CN113920372A - Data classification method, device, equipment and storage medium - Google Patents

Data classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN113920372A
CN113920372A CN202111265054.7A CN202111265054A CN113920372A CN 113920372 A CN113920372 A CN 113920372A CN 202111265054 A CN202111265054 A CN 202111265054A CN 113920372 A CN113920372 A CN 113920372A
Authority
CN
China
Prior art keywords
dimension
classification
initial
training
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111265054.7A
Other languages
Chinese (zh)
Inventor
刘延磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202111265054.7A priority Critical patent/CN113920372A/en
Publication of CN113920372A publication Critical patent/CN113920372A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of data processing, and discloses a data classification method, a device, equipment and a storage medium, wherein the method comprises the following steps: obtaining the dimension weight corresponding to each initial dimension; respectively judging whether each dimension weight meets a preset weight range, and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension; calculating a training dimension parameter of each training sample under each reference dimension; carrying out classification calculation on the training dimension parameters through an initial classification model to obtain a training classification result of each training sample; obtaining a classification confidence corresponding to a training classification result according to a preset classification condition, adjusting calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model, and classifying target data through the adjusted classification model to obtain a target classification result; therefore, data classification is carried out by combining multi-dimensional influence factors, and the accuracy of data classification is improved.

Description

Data classification method, device, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data classification method, apparatus, device, and storage medium.
Background
For industries facing customer groups, particularly industries related to financial transactions, risk assessment is generally required to be performed on customers, a traditional classification mode is generally a tree-type classification mode, the tree-type classification mode is often combined and classified according to influence importance of various dimensional factors, for example, a city is taken as a first level, the industries are taken as a second level, and the like, however, the risk classification mode of the factors can cause difficulty in comprehensively utilizing multi-dimensional influence factors to perform risk level classification, and further the accuracy of risk level classification is low.
Disclosure of Invention
The application mainly aims to provide a data classification method, a data classification device, data classification equipment and a storage medium, and aims to solve the problem that a tree type classification mode in the prior art is difficult to combine with multi-dimensional influence factors for data classification.
In order to achieve the above object, the present application provides a data classification method, including:
acquiring a plurality of initial dimensions, and respectively inputting each initial dimension into a preset weight calculation model to respectively obtain a dimension weight corresponding to each initial dimension;
respectively judging whether each dimension weight meets a preset weight range, and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension;
acquiring training samples, and calculating a training dimension parameter of each training sample under each reference dimension;
establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample;
obtaining a classification confidence corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model;
and acquiring target data, and classifying the target data through the adjusting classification model to obtain a target classification result.
Further, the performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample includes:
vectorizing calculation is carried out on the training dimension parameters through a vectorizing algorithm in the initial classification model to obtain a first attribute vector;
and acquiring a standard attribute vector corresponding to each classification grade, calculating a cosine value between the first attribute vector and each standard attribute vector through a cosine similarity algorithm in the initial classification model, acquiring the cosine similarity through the cosine value, and acquiring a training classification result of each training sample according to the cosine similarity.
Further, after obtaining the target classification result, the method further includes:
identifying behavior information of an initial user, taking the initial user of which the behavior information meets a preset behavior condition as a target user, and performing behavior evaluation on the target user through a 6 sigma algorithm;
and adjusting the target classification result of the target user according to the evaluation result of the behavior evaluation to obtain an adjusted classification result.
Further, before obtaining the plurality of initial dimensions, the method further includes:
acquiring dispersion dimensions, and calculating the dimension similarity of each dispersion dimension;
and merging the dispersed dimensions of which the dimension similarity meets a preset dimension similarity range into the same initial dimension.
Further, the obtaining target data and classifying the target data through the adjusted classification model includes:
identifying authority information corresponding to the target data, and identifying whether an unauthorized dimension exists in the reference dimension according to the authority information;
if yes, all historical dimension parameters corresponding to the unauthorized dimensions in the adjustment classification model are obtained, and an average dimension parameter of the historical dimension parameters is calculated, so that the adjustment classification model classifies the target data according to the average dimension parameter.
Further, before calculating the training dimension parameter of each training sample in each reference dimension, the method further includes:
receiving a dimension supplement instruction, identifying a supplement dimension corresponding to the dimension supplement instruction, and taking the supplement dimension as the reference dimension.
Further, after obtaining the target classification result, the method further includes:
acquiring risk processing information from a preset information base according to the adjustment classification result;
and sending the risk processing information to the target user corresponding to the adjustment classification result.
The present application further provides a data classification apparatus, including:
the weight calculation module is used for acquiring a plurality of initial dimensions, inputting each initial dimension into a preset weight calculation model respectively, and obtaining a dimension weight corresponding to each initial dimension respectively;
the dimension selection module is used for respectively judging whether each dimension weight meets a preset weight range or not and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension;
the parameter calculation module is used for acquiring training samples and calculating a training dimension parameter of each training sample under each reference dimension;
the first classification module is used for establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample;
the model training module is used for obtaining a classification confidence coefficient corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence coefficient to obtain an adjusted classification model;
and the second classification module is used for acquiring target data and classifying the target data through the adjustment classification model to obtain a target classification result.
The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.
According to the data classification method, the data classification device, the data classification equipment and the data classification storage medium, the plurality of initial dimensions are obtained, and the weight distribution is carried out on each initial dimension through the weight calculation model, so that the multidimensional comprehensive calculation is realized, and the comprehensiveness of the dimensions and the balance of the weight distribution are improved; the initial dimensionality is screened based on the dimensionality weight, so that the problems that the calculation process is complicated and the classification efficiency is low due to the fact that the initial dimensionality is influenced a little too much are solved; the training information is quantized by acquiring the training dimension parameters of the training sample under each reference dimension, so that the subsequent model training is facilitated; the training samples are classified and calculated through the established initial classification model, the classification confidence of the training classification result is calculated through the classification conditions, the calculation parameters in the initial classification model are adjusted according to the classification confidence, and the adjusted classification model is obtained, so that the classification effect and the classification accuracy of the adjusted classification model on the target data are improved.
Drawings
FIG. 1 is a schematic flow chart illustrating a data classification method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating a data classification method according to an embodiment of the present application;
FIG. 3 is a block diagram illustrating an exemplary data classification apparatus according to an embodiment of the present disclosure;
fig. 4 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, in order to achieve the above object, the present application proposes a data classification method, including:
s1: acquiring a plurality of initial dimensions, and respectively inputting each initial dimension into a preset weight calculation model to respectively obtain a dimension weight corresponding to each initial dimension;
s2: respectively judging whether each dimension weight meets a preset weight range, and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension;
s3: acquiring training samples, and calculating a training dimension parameter of each training sample under each reference dimension;
s4: establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample;
s5: obtaining a classification confidence corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model;
s6: and acquiring target data, and classifying the target data through the adjusting classification model to obtain a target classification result.
In the embodiment, a plurality of initial dimensions are obtained, and the weight distribution is performed on each initial dimension through the weight calculation model, so that the multi-dimensional comprehensive calculation is realized, and the comprehensiveness of the dimensions and the balance of the weight distribution are improved; the initial dimensionality is screened based on the dimensionality weight, so that the problems that the calculation process is complicated and the classification efficiency is low due to the fact that the initial dimensionality is influenced a little too much are solved; the training information is quantized by acquiring the training dimension parameters of the training sample under each reference dimension, so that the subsequent model training is facilitated; the training samples are classified and calculated through the established initial classification model, the classification confidence of the training classification result is calculated through the classification conditions, the calculation parameters in the initial classification model are adjusted according to the classification confidence, and the adjusted classification model is obtained, so that the classification effect and the classification accuracy of the adjusted classification model on the target data are improved.
For step S1, the present embodiment is applied to data classification, especially to the application of customer data classification in the financial transaction industry. With the digitalization of the financial industry, the mode of risk assessment for the client is more convenient, however, in the actual life, indexes influencing the risk level of the client are not single, so that various parameters possibly influencing the risk level need to be comprehensively calculated; specifically, before comprehensive calculation, initial dimensions that may affect the risk level are acquired, where the initial dimensions include an urban area, an urban level, a customer age, a local industry, an occupation, a post, a academic history, a marital status, an income status, an expenditure status, a number of children, and the like, and at this time, each of the initial dimensions may be input into a weight calculation model to obtain a corresponding dimension weight, where the weight calculation model includes: the system comprises a user perception factor calculation module, a dimension weight adjustment factor calculation module, a data inspection parameter module, an attention area identification module and a dimension weight distribution calculation module; specifically, the user perception factor calculation module is used for representing an original level of satisfaction degree of a user corresponding to the existing evaluated dimension aiming at a certain dimension index; the dimensional performance index weight adjustment factor calculation module is used for calculating a dimensional performance index weight adjustment factor; the data inspection parameter module is used for judging the reasonability and correctness of the data obtained by the user on the dimension evaluation module and the professional on the dimension evaluation module to obtain the accuracy degree of the data; the concerned area identification module is used for paying special attention to the performance indexes which are extremely low in evaluation of the user and the professional, and is beneficial to elimination of error data and improvement of individual performance; the dimension weight distribution calculation module is used for combining the original importance data with the dimension performance index weight adjustment coefficient to obtain a weight distribution result after the dimension performance index is subjected to normalization processing; illustratively, according to feedback opinions of users and professionals, corresponding weight reduction is carried out on the dimensions which are related to each other and point to be close, such as city regions and city levels, through a weight distribution model, so that the sum of the dimensions meets a proper weight range, and the problem that the dimensions with close meanings affect the balance of the overall weight is avoided. In the embodiment, the comprehensiveness of the dimensions and the balance of weight distribution are improved by acquiring a plurality of initial dimensions and performing weight distribution on each initial dimension through the weight calculation model.
For step S2, because the initial dimensions affecting the risk level may be several, dozens or dozens, when the initial dimensions are many, there will often exist some dimensions with low influence on the current risk level calculation in these initial dimensions, and after quantization, the dimensions are usually represented as low in weight, in order to improve the calculation efficiency, a weight range may be set for screening to avoid a large number of dimensions with low influence interfering with the calculation, the weight range may be greater than the minimum weight value, the minimum weight value may be 5%, and the like, and may be set according to the actual demand; in the embodiment, the initial dimensionality is screened based on the dimensionality weight, so that the problems that the calculation process is more complicated and the classification efficiency is lower due to the fact that the initial dimensionality with smaller influence is too much are avoided.
For step S3, after the determination of the reference dimension is completed, the training dimension parameters of the training sample in the reference dimension may be obtained; the training sample may be preset customer information, for example, if the reference dimension finally obtained by screening from the initial dimension includes a customer age, a located industry, an income status, a payment status, and a number of children, then a part of information corresponding to the customer age, the located industry, the income status, the payment status, and the number of children in each piece of customer information is obtained as a training dimension parameter, for example, 30 years old, a building industry, and the like. In the embodiment, the training dimension parameters of the training samples under each reference dimension are obtained, so that the training information is quantized, and the subsequent model training is facilitated.
For step S4, building the initial network model includes: establishing a standard attribute vector data set, initializing the network quantity and network parameters of each neural layer, and establishing a feedback function; the neural layer comprises a data vectorization algorithm and a cosine similarity algorithm which are arranged in sequence, training dimensional parameters are input into an initial classification model and then are vectorized through a neural network, the similarity between the vector of the training dimensional parameters and a standard attribute vector is calculated through the cosine similarity algorithm, the standard attribute vector data set comprises a plurality of standard attribute vectors corresponding to classification grades, and each classification grade vector corresponds to a risk grade; after the similarity between the vector of the training dimensional parameter and the standard vector is calculated, the classification grade vector with the maximum similarity with the vector of the training dimensional parameter is used as the risk grade of the training sample of the training dimensional parameter; and sequentially carrying out the calculation on each training sample to obtain a training classification result. In this embodiment, a data basis is provided for subsequent model adjustment steps by establishing an initial classification model and obtaining a training classification result through the initial classification model.
For step S5, a correct classification result may be preset for each training sample, or the training classification result may be verified in a human recognition manner, so as to obtain whether the classification is accurate, that is, the above classification confidence level, if the classification confidence level is low, it indicates that the algorithm in the current adjustment model needs to be adjusted, at this time, the initial classification model may be subjected to parameter adjustment through the neural network, and the training samples are input into the adjusted model to obtain the next round of classification results, and the iteration is performed in sequence until the classification confidence level meets the preset condition, so as to obtain the adjustment classification model with a higher confidence level.
For step S6, the target data may be user data to be risk-assessed, which includes several dimensional parameters; after the target data are obtained, the target data are classified by adopting the adjustment classification model trained based on the classification confidence coefficient, and the classification grade with the highest similarity to the target data is obtained, so that the accuracy of the adjustment classification model for classifying the target data is improved.
In an embodiment, the performing, by the initial classification model, a classification calculation on the training dimension parameters to obtain a training classification result of each training sample includes:
s41: vectorizing calculation is carried out on the training dimension parameters through a vectorizing algorithm in the initial classification model to obtain a first attribute vector;
s42: and acquiring a standard attribute vector corresponding to each classification grade, calculating a cosine value between the first attribute vector and each standard attribute vector through a cosine similarity algorithm in the initial classification model, acquiring the cosine similarity through the cosine value, and acquiring a training classification result of each training sample according to the cosine similarity.
In this embodiment, vectorization calculation is performed on the training dimensional parameters through the initial classification model, and cosine similarity calculation is performed on the first attribute vector and the standard attribute vector by using a cosine similarity calculation method to obtain a similarity between the training dimensional parameters and the preset standard attribute vector, so as to obtain an accurate risk level of each training sample.
For step S41, the vectorization algorithm may include at least one of bag-of-words model vectorization, TF-IDF model vectorization, and LSI model vectorization.
For step S42, the criterion attribute vector includes a classification level vector of several risk levels, and each criterion vector corresponds to a classification level of risk.
Specifically, the cosine value between the first attribute vector and the standard attribute vector can be obtained by the following euclidean dot product formula;
a·b-|a||b|cosθ.
the first attribute vector a and the standard attribute vector B are respectively substituted into a and B of the above formula to obtain a cosine value cos θ between the first attribute vector a and the standard attribute vector B, and the cosine similarity θ is given by a dot product and a vector length, and the calculation mode is as follows:
Figure BDA0003326555130000081
because the cosine value is close to 1, the included angle is close to 0, which indicates that the two vectors are more similar, the cosine value is close to 0, the included angle is close to 90 degrees, which indicates that the two vectors are more dissimilar, the similarity between the training dimension parameter and each standard attribute vector can be obtained according to the formula, and the classification grade corresponding to the standard attribute vector with the highest cosine similarity is used as the class of the training dimension parameter, so that the risk grade of each training sample is obtained.
In one embodiment, referring to fig. 2, after obtaining the target classification result S6, the method further includes:
s71: identifying behavior information of an initial user, taking the initial user of which the behavior information meets a preset behavior condition as a target user, and performing behavior evaluation on the target user through a 6 sigma algorithm;
s72: and adjusting the target classification result of the target user according to the evaluation result of the behavior evaluation to obtain an adjusted classification result.
According to the embodiment, the times that the user does not execute standards such as timely repayment and the like are calculated through the 6 sigma algorithm, the comprehensiveness of the sample covering industry client is guaranteed, and the accuracy of risk classification is improved.
For step S71, the action condition may be the generation of loan or repayment, and for the user who has generated loan, the risk assessment may be performed by using the 6 sigma algorithm on the user who generated loan in different loan scenes; specifically, the 6 sigma algorithm is a statistical measure of the quality level of only 3.4 defects per million, and in terms of risk classification, the standard deviation measures the ability of a particular process to perform perfect work, and the number of series of standard deviations increases, the fewer defects that appear, in this algorithm, "6" represents a standard value, and for example, if there is a temperature controller, the user wishes to maintain the room temperature at around 70 degrees, at which time if the temperature is maintained between 67 and 73 degrees, this is considered an acceptable standard, if the temperature fluctuates between 55 degrees and 85 degrees, the variation range is too large to meet the required standard, and in the embodiment, the times that the user does not execute standards such as on-time payment are calculated through the algorithm, the comprehensiveness of the sample covering industry clients is guaranteed, and the accuracy of risk classification is improved.
In one embodiment, before obtaining the number of initial dimensions, the method further includes:
s101: acquiring dispersion dimensions, and calculating the dimension similarity of each dispersion dimension;
s102: and merging the dispersed dimensions of which the dimension similarity meets a preset dimension similarity range into the same initial dimension.
According to the embodiment, the problem that the calculation efficiency is reduced due to dimension repetition is avoided by merging the dispersion dimensions according to the semantic similarity of the dispersion dimensions.
For step S101, before obtaining the initial dimension, aggregating each dispersion dimension in advance; the dispersion dimension may be all dimensions related to risk rating assessment, i.e., besides the above-mentioned initial dimensions of urban territory, urban level, customer age, industry, occupation, post, academic calendar, marital status, income status, expenditure status, number of children, etc., the dispersion dimension may also include year of birth, graduate college, etc., it being understood that the year of birth is substantially similar to the actual meaning of the customer age, and the graduate college is substantially similar to the actual meaning of the academic calendar.
For step S102, in order to avoid the reduction of the computation efficiency caused by the repeated occurrence of such dimensions, the present embodiment combines the dispersed dimensions according to the semantic similarity thereof before classification. Specifically, semantic similarity calculation can be performed by constructing a neural network, namely Word embedding is performed through GloVe or Word2vec, then the dispersion dimension is encoded through a statement encoding model to obtain a vector corresponding to the dispersion dimension, so that the semantic similarity is calculated through a vector distance, the semantic similarity is dimension similarity, and finally merging of the dispersion dimension is completed according to the dimension similarity.
In one embodiment, the obtaining target data and classifying the target data by the adjusted classification model includes:
s61: identifying authority information corresponding to the target data, and identifying whether an unauthorized dimension exists in the reference dimension according to the authority information;
s62: if yes, all historical dimension parameters corresponding to the unauthorized dimensions in the adjustment classification model are obtained, and an average dimension parameter of the historical dimension parameters is calculated, so that the adjustment classification model classifies the target data according to the average dimension parameter.
In the embodiment, the unauthorized data of different target data are identified, and the data are filled up through the average dimension parameter, so that the problem of overlarge classification error caused by parameter deletion corresponding to unauthorized dimensions is solved.
For step S61, in a specific application, since different service types have different requirements for obtaining the private data of the user, before the classification, it needs to check whether all the reference dimensions are valid dimensions of the target data through the dimension authority.
For step S62, when the authority information is checked, it is identified that there is an unauthorized dimension in some target data, and in order to avoid a large error of the classification result due to data loss of the dimension, the embodiment uses the average dimension parameter of the dimension as the padding data, thereby reducing the classification error of the target data; specifically, historical dimension parameters of all historical target data classified by the adjusting classification model under the unauthorized dimension are obtained, and average value calculation is performed on the historical dimension parameters to obtain more appropriate filling data.
In one embodiment, before the calculating the training dimension parameter of each of the training samples in each of the reference dimensions, the method further includes:
s301: receiving a dimension supplement instruction, identifying a supplement dimension corresponding to the dimension supplement instruction, and taking the supplement dimension as the reference dimension.
In the embodiment, the dimension supplementing instruction is received, so that missing and missing of the reference dimension are checked, and the accuracy of data classification is improved.
For step S301, in order to avoid the problem that the necessary dimension with a lower dimension weight is screened out or the obtained initial dimension is missing, the embodiment performs dimension supplementation by receiving the reference dimension manually input by the administrator, so as to improve the accuracy of data classification.
In one embodiment, after obtaining the target classification result S6, the method further includes:
s73: acquiring risk processing information from a preset information base according to the adjustment classification result;
s74: and sending the risk processing information to the target user corresponding to the adjustment classification result.
According to the method and the device, the risk processing information is generated by adjusting the classification result and is sent to the user, so that the user can correct the risk behavior of the user according to the received risk data information, and the real-time performance of information transmission is improved.
For step S73, the information base may store a plurality of different risk processing information in advance, where the different risk processing information corresponds to different adjustment classification results; for example, if the classification result is adjusted to have a high risk level, the corresponding risk processing information may include that overdue loans are cleared within 30 working days, the number of credit cards for arbitrage is reduced, and the like; if the risk level is low, the corresponding risk processing information may be to maintain the current payment status, and the like.
For step S74, risk processing information is sent to the corresponding target user, which can facilitate the user to correct and feed back the risk behavior in time, and improve the timeliness of information transmission.
Referring to fig. 3, the present application also proposes a data classification apparatus including:
the weight calculation module 100 is configured to obtain a plurality of initial dimensions, input each of the initial dimensions into a preset weight calculation model, and obtain a dimension weight corresponding to each of the initial dimensions;
the dimension selection module 200 is configured to respectively determine whether each of the dimension weights meets a preset weight range, and use an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension;
a parameter calculation module 300, configured to obtain training samples, and calculate a training dimension parameter of each training sample in each reference dimension;
the first classification module 400 is configured to establish an initial classification model, perform classification calculation on the training dimension parameters through the initial classification model, and obtain a training classification result of each training sample;
the model training module 500 is configured to obtain a classification confidence corresponding to the training classification result according to a preset classification condition, and adjust the calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model;
and a second classification module 600, configured to obtain target data, and classify the target data through the adjusted classification model to obtain a target classification result.
In the embodiment, a plurality of initial dimensions are obtained, and the weight distribution is performed on each initial dimension through the weight calculation model, so that the multi-dimensional comprehensive calculation is realized, and the comprehensiveness of the dimensions and the balance of the weight distribution are improved; the initial dimensionality is screened based on the dimensionality weight, so that the problems that the calculation process is complicated and the classification efficiency is low due to the fact that the initial dimensionality is influenced a little too much are solved; the training information is quantized by acquiring the training dimension parameters of the training sample under each reference dimension, so that the subsequent model training is facilitated; the training samples are classified and calculated through the established initial classification model, the classification confidence of the training classification result is calculated through the classification conditions, the calculation parameters in the initial classification model are adjusted according to the classification confidence, and the adjusted classification model is obtained, so that the classification effect and the classification accuracy of the adjusted classification model on the target data are improved.
In one embodiment, the first classification module 400 is further configured to:
vectorizing calculation is carried out on the training dimension parameters through a vectorizing algorithm in the initial classification model to obtain a first attribute vector;
and acquiring a standard attribute vector corresponding to each classification grade, calculating a cosine value between the first attribute vector and each standard attribute vector through a cosine similarity algorithm in the initial classification model, acquiring the cosine similarity through the cosine value, and acquiring a training classification result of each training sample according to the cosine similarity.
In one embodiment, the second classification module 600 is further configured to:
identifying behavior information of an initial user, taking the initial user of which the behavior information meets a preset behavior condition as a target user, and performing behavior evaluation on the target user through a 6 sigma algorithm;
and adjusting the target classification result of the target user according to the evaluation result of the behavior evaluation to obtain an adjusted classification result.
In one embodiment, the weight calculating module 100 is further configured to:
acquiring dispersion dimensions, and calculating the dimension similarity of each dispersion dimension;
and merging the dispersed dimensions of which the dimension similarity meets a preset dimension similarity range into the same initial dimension.
In one embodiment, the second classification module 600 is further configured to:
identifying authority information corresponding to the target data, and identifying whether an unauthorized dimension exists in the reference dimension according to the authority information;
if yes, all historical dimension parameters corresponding to the unauthorized dimensions in the adjustment classification model are obtained, and an average dimension parameter of the historical dimension parameters is calculated, so that the adjustment classification model classifies the target data according to the average dimension parameter.
In one embodiment, the parameter calculating module 300 is further configured to:
receiving a dimension supplement instruction, identifying a supplement dimension corresponding to the dimension supplement instruction, and taking the supplement dimension as the reference dimension.
In one embodiment, the second classification module 600 is further configured to:
acquiring risk processing information from a preset information base according to the adjustment classification result;
and sending the risk processing information to the target user corresponding to the adjustment classification result.
Referring to fig. 4, a computer device, which may be a server and whose internal structure may be as shown in fig. 4, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as data classification methods and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data classification method. The data classification method comprises the following steps: acquiring a plurality of initial dimensions, and respectively inputting each initial dimension into a preset weight calculation model to respectively obtain a dimension weight corresponding to each initial dimension; respectively judging whether each dimension weight meets a preset weight range, and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension; acquiring training samples, and calculating a training dimension parameter of each training sample under each reference dimension; establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample; obtaining a classification confidence corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model; and acquiring target data, and classifying the target data through the adjusting classification model to obtain a target classification result.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a data classification method, including the steps of: acquiring a plurality of initial dimensions, and respectively inputting each initial dimension into a preset weight calculation model to respectively obtain a dimension weight corresponding to each initial dimension; respectively judging whether each dimension weight meets a preset weight range, and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension; acquiring training samples, and calculating a training dimension parameter of each training sample under each reference dimension; establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample; obtaining a classification confidence corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model; and acquiring target data, and classifying the target data through the adjusting classification model to obtain a target classification result.
In the data classification method, the plurality of initial dimensions are obtained, and the weight distribution is performed on each initial dimension through the weight calculation model, so that the multidimensional comprehensive calculation is realized, and the comprehensiveness of the dimensions and the balance of the weight distribution are improved; the initial dimensionality is screened based on the dimensionality weight, so that the problems that the calculation process is complicated and the classification efficiency is low due to the fact that the initial dimensionality is influenced a little too much are solved; the training information is quantized by acquiring the training dimension parameters of the training sample under each reference dimension, so that the subsequent model training is facilitated; the training samples are classified and calculated through the established initial classification model, the classification confidence of the training classification result is calculated through the classification conditions, the calculation parameters in the initial classification model are adjusted according to the classification confidence, and the adjusted classification model is obtained, so that the classification effect and the classification accuracy of the adjusted classification model on the target data are improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (10)

1. A method of data classification, the method comprising:
acquiring a plurality of initial dimensions, and respectively inputting each initial dimension into a preset weight calculation model to respectively obtain a dimension weight corresponding to each initial dimension;
respectively judging whether each dimension weight meets a preset weight range, and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension;
acquiring training samples, and calculating a training dimension parameter of each training sample under each reference dimension;
establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample;
obtaining a classification confidence corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence to obtain an adjusted classification model;
and acquiring target data, and classifying the target data through the adjusting classification model to obtain a target classification result.
2. The data classification method according to claim 1, wherein the performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample includes:
vectorizing calculation is carried out on the training dimension parameters through a vectorizing algorithm in the initial classification model to obtain a first attribute vector;
and acquiring a standard attribute vector corresponding to each classification grade, calculating a cosine value between the first attribute vector and each standard attribute vector through a cosine similarity algorithm in the initial classification model, acquiring the cosine similarity through the cosine value, and acquiring a training classification result of each training sample according to the cosine similarity.
3. The data classification method according to claim 1, further comprising, after obtaining the target classification result:
identifying behavior information of an initial user, taking the initial user of which the behavior information meets a preset behavior condition as a target user, and performing behavior evaluation on the target user through a 6 sigma algorithm;
and adjusting the target classification result of the target user according to the evaluation result of the behavior evaluation to obtain an adjusted classification result.
4. The data classification method according to claim 1, wherein before obtaining the plurality of initial dimensions, the method further comprises:
acquiring dispersion dimensions, and calculating the dimension similarity of each dispersion dimension;
and merging the dispersed dimensions of which the dimension similarity meets a preset dimension similarity range into the same initial dimension.
5. The data classification method according to claim 1, wherein the obtaining target data and the classifying the target data by the adjusted classification model comprise:
identifying authority information corresponding to the target data, and identifying whether an unauthorized dimension exists in the reference dimension according to the authority information;
if yes, all historical dimension parameters corresponding to the unauthorized dimensions in the adjustment classification model are obtained, and an average dimension parameter of the historical dimension parameters is calculated, so that the adjustment classification model classifies the target data according to the average dimension parameter.
6. The data classification method according to claim 1, wherein the calculating of the training dimension parameter of each of the training samples in each of the reference dimensions further comprises:
receiving a dimension supplement instruction, identifying a supplement dimension corresponding to the dimension supplement instruction, and taking the supplement dimension as the reference dimension.
7. The data classification method according to claim 3, further comprising, after obtaining the target classification result:
acquiring risk processing information from a preset information base according to the adjustment classification result;
and sending the risk processing information to the target user corresponding to the adjustment classification result.
8. A data sorting apparatus, comprising:
the weight calculation module is used for acquiring a plurality of initial dimensions, inputting each initial dimension into a preset weight calculation model respectively, and obtaining a dimension weight corresponding to each initial dimension respectively;
the dimension selection module is used for respectively judging whether each dimension weight meets a preset weight range or not and taking an initial dimension corresponding to the dimension weight meeting the weight range as a reference dimension;
the parameter calculation module is used for acquiring training samples and calculating a training dimension parameter of each training sample under each reference dimension;
the first classification module is used for establishing an initial classification model, and performing classification calculation on the training dimension parameters through the initial classification model to obtain a training classification result of each training sample;
the model training module is used for obtaining a classification confidence coefficient corresponding to the training classification result according to a preset classification condition, and adjusting the calculation parameters in the initial classification model according to the classification confidence coefficient to obtain an adjusted classification model;
and the second classification module is used for acquiring target data and classifying the target data through the adjustment classification model to obtain a target classification result.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111265054.7A 2021-10-28 2021-10-28 Data classification method, device, equipment and storage medium Pending CN113920372A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111265054.7A CN113920372A (en) 2021-10-28 2021-10-28 Data classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111265054.7A CN113920372A (en) 2021-10-28 2021-10-28 Data classification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113920372A true CN113920372A (en) 2022-01-11

Family

ID=79243669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111265054.7A Pending CN113920372A (en) 2021-10-28 2021-10-28 Data classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113920372A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357037A (en) * 2022-03-22 2022-04-15 苏州浪潮智能科技有限公司 Time sequence data analysis method and device, electronic equipment and storage medium
CN115022014A (en) * 2022-05-30 2022-09-06 平安银行股份有限公司 Login risk identification method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357037A (en) * 2022-03-22 2022-04-15 苏州浪潮智能科技有限公司 Time sequence data analysis method and device, electronic equipment and storage medium
CN115022014A (en) * 2022-05-30 2022-09-06 平安银行股份有限公司 Login risk identification method, device, equipment and storage medium
CN115022014B (en) * 2022-05-30 2023-07-14 平安银行股份有限公司 Login risk identification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108876133B (en) Risk assessment processing method, device, server and medium based on business information
US10963791B2 (en) Optimizing neural networks for risk assessment
CN109345374B (en) Risk control method and device, computer equipment and storage medium
CN109858737B (en) Grading model adjustment method and device based on model deployment and computer equipment
WO2020000688A1 (en) Financial risk verification processing method and apparatus, computer device, and storage medium
CN113920372A (en) Data classification method, device, equipment and storage medium
WO2014055238A1 (en) System and method for building and validating a credit scoring function
CN112102073A (en) Credit risk control method and system, electronic device and readable storage medium
CN111553390A (en) User classification method and device, computer equipment and storage medium
CN113139876B (en) Risk model training method, risk model training device, computer equipment and readable storage medium
CN111062444A (en) Credit risk prediction method, system, terminal and storage medium
CN111738762A (en) Method, device, equipment and storage medium for determining recovery price of poor assets
CN113327164A (en) Risk control method and device for futures trading and computer equipment
CN114187125A (en) Claims case distribution method, device, equipment and storage medium
CN111340365A (en) Enterprise data processing method and device, computer equipment and storage medium
CN113011961B (en) Method, device, equipment and storage medium for monitoring risk of company-related information
CN112990989B (en) Value prediction model input data generation method, device, equipment and medium
CN112037005B (en) Fusion method and device of score cards, computer equipment and storage medium
CN110765351A (en) Target user identification method and device, computer equipment and storage medium
CN112035775B (en) User identification method and device based on random forest model and computer equipment
CN117575773A (en) Method, device, computer equipment and storage medium for determining service data
CN116128339A (en) Client credit evaluation method and device, storage medium and electronic equipment
US20220414125A1 (en) Systems and Methods for Computer Modeling Using Incomplete Data
O’Hagan et al. Model-based and nonparametric approaches to clustering for data compression in actuarial applications
CN115170271A (en) Clustering method, device, equipment and storage medium for risk associated enterprises

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination