CN112232774B - Account clearing and backing and memory allocation prediction method for office automation system - Google Patents

Account clearing and backing and memory allocation prediction method for office automation system Download PDF

Info

Publication number
CN112232774B
CN112232774B CN202011125936.9A CN202011125936A CN112232774B CN 112232774 B CN112232774 B CN 112232774B CN 202011125936 A CN202011125936 A CN 202011125936A CN 112232774 B CN112232774 B CN 112232774B
Authority
CN
China
Prior art keywords
mailbox
prediction
memory
office automation
account
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011125936.9A
Other languages
Chinese (zh)
Other versions
CN112232774A (en
Inventor
承春明
赵欣慧
王永翔
赵东坡
刘思远
陈恩权
张瑞
王金珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luohe Power Supply Company State Grid Henan Electric Power Co
Original Assignee
Luohe Power Supply Company State Grid Henan Electric Power Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luohe Power Supply Company State Grid Henan Electric Power Co filed Critical Luohe Power Supply Company State Grid Henan Electric Power Co
Priority to CN202011125936.9A priority Critical patent/CN112232774B/en
Publication of CN112232774A publication Critical patent/CN112232774A/en
Application granted granted Critical
Publication of CN112232774B publication Critical patent/CN112232774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Computer Hardware Design (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an account clearing and returning and memory allocation prediction method for an office automation system, which comprises a zombie account screening and clearing module and a mailbox size prediction and allocation module; the processing steps of zombie account screening and returning comprise: 11) and (3) data comparison: comparing the person information table with the address book, and then comparing the address book with the background data file; 12) filtering and analyzing activity log records; 13) comparing the results of the activity degrees, and after determining the list, issuing the bulletin and cleaning the bulletin regularly. The mailbox size prediction and allocation processing step comprises the following steps: 21) data preprocessing, 22) performing classification training test; 23) carrying out regression training test: 24) the method is applied. The method can scientifically and reasonably clear and retreat the zombie account number and the memory, and predict the reasonable mailbox memory which should be distributed for different kinds of workers by combining the journal record of the oa office automation server in the past period, thereby reducing the burden of the server, ensuring the stable operation of the office automation server and providing convenience for the workers.

Description

Account clearing and backing and memory allocation prediction method for office automation system
Technical Field
The invention belongs to the field of office automation system management, and particularly relates to an account clearing and memory allocation prediction method for an office automation system.
Background
At present, office automation systems adopted in city and county in various places of some big companies are old, and with increasing mail forwarding and storage, a plurality of temporary account numbers opened by city and county shift, borrowing, third-party companies and the like in the past year increase, so that risks are increased for stable operation of servers. How to reasonably utilize human resource and personnel information, an address book, background personnel data files and dozens of millions of background log records, scientifically and intelligently clear and quit zombie accounts and data files, and reasonably predict memory values to be allocated according to the log records becomes important.
Disclosure of Invention
In order to solve the defects of the prior art, the invention aims to provide an account clearing and memory allocation prediction method for an office automation system. Through using current python office automation technology, information extraction technology and mainstream big data technology, can scientific rationalization clear move back zombie account number and memory to combine current oa office automation server log record, the prediction should be for the reasonable mailbox memory of different kind employees distribution, alleviate server burden, when the guarantee office automation server steady operation, for the employee facilitates.
In order to achieve the purpose, the invention adopts the technical scheme that:
an account number clearing and memory allocation prediction method for an office automation system comprises two processing modules: a zombie account screening and clearing module and a mailbox size prediction and distribution module;
the processing steps of the zombie account screening and clearing module comprise:
11) and (3) data comparison: comparing the human information table with the communication record table, and then comparing the address book with the background data file;
12) filtering and analyzing activity log records;
13) comparing the results in the step 12), and after determining the list, issuing a notice and regularly cleaning;
wherein the human resources list refers to a list of the persons in the main business provided by the human resources department; the communication record table refers to personnel data in an office automation oa management page; the background data file refers to a background mapping file used for managing data in the page;
the mailbox size predicting and distributing module comprises the following processing steps:
21) the data preprocessing specifically comprises the following steps:
21-1) log acquisition and filtering: exporting an oa office automation log file nsf, compiling all daily operation records of filtering, receiving, sending, setting and the like of different regular expressions;
21-2) feature value extraction and calculation: designing all the characteristic fields by combining the actual work content, and continuously eliminating the characteristic fields in the training process; wherein the characteristic field includes: the method comprises the steps of (1) adjusting the total memory amount, the used memory, the memory utilization rate, the memory monthly change rate, the mail receiving and sending frequency, the attachment size threshold, the mailbox local backup degree, the mailbox local backup space size, the self-cleaning degree, the mailbox size record number, the mailbox adjustment change value and whether the classification prediction result is adjusted or not;
21-3) normalization treatment: after the feature vectors of the feature fields are calculated, the difference of the values is large, the proportion of each feature vector value to all the values is calculated respectively and used as each feature vector value, the value range is zoomed, and the discrimination is increased;
22) carrying out classification training test;
23) carrying out regression training test: according to the classification result of the step 22), carrying out quantitative regression prediction on the data needing to redistribute the mailbox sizes, thereby achieving the goal of scientifically distributing the mailbox sizes;
24) for the input of daily operation record data of the account, classification prediction is firstly completed, and then prediction of the size of the mailbox is scientifically realized after the classification prediction is completed.
Further, step 12 specifically includes the following steps:
12-1) exporting an automated office activity log; 12-2) performing log analysis by combing log composition, function classification and attribute values; 12-3) writing a regular expression, and further screening dates, personnel and operation records in activity records containing 'X delivery to Y' and 'Replicate'; 12-4) judging whether the data contains all activity records; if all the activity records are contained, calculating the activity degree according to the receiving and sending records, the operation records and the backup records; and if not, returning to the step 12-3) for re-screening.
Further, the activity calculation in the step 12-4) specifically comprises: counting the checking, sending, deleting and setting frequency of each mailbox of the company by taking a month as a unit according to the result of log filtering; setting the frequency of the continuous two-month counting, sending and sending as 0, and taking the activity value as 0, otherwise taking 1.
Further, the method for calculating the total amount of the existing memory, the existing used memory and the memory usage rate in the step 21-2) specifically comprises the following steps:
21-2-1) calculating the total memory amount, the existing used memory and the memory utilization rate in a month unit since the creation of the mailbox application;
21-2-2) respectively calculating mode values, average values and weighted average values of the three types of attributes, respectively using the three types of values as feature vectors, and subsequently selecting corresponding values according to the prediction accuracy.
Further, step 22) the classification training test is implemented by the following steps:
22-1) selecting a logistic regression algorithm, and finishing initialization of corresponding parameters by combining a logistic regression principle and application requirements:
suppose that: y is 1 1 is a positive class in the two classes, namely the mailbox size needs to be adjusted;
y 2 0 is the inverse of the two categories, i.e. mailbox size does not need to be adjusted;
the first step is as follows: assuming the function:
first, a Sigmoid function is defined:
Figure GDA0003742866510000041
in the linear regression algorithm, the hypothesis function is defined as h θ x=θ T x, wherein θ is a parameter; at this time, the range of the assumed function may be (— ∞, + ∞); in the case of the two-classification,the output y can only take the value of 0 or 1 at theta T x is wrapped with a layer of Sigmoid function to make the value range belong to (0, 1), and the following definition is given:
Figure GDA0003742866510000042
wherein P represents the corresponding probability when y outputs 0 or 1;
if h θ x is 0.8, which means that there is a probability y of 80% 1 I.e. representing the probability that y is 1 when the input is x; accordingly, y ═ y 2 When, the probability that y is 0 is 20%;
the second step is that:
from the above assumption that the function represents the probability, it can be deduced that:
if it is not
Figure GDA0003742866510000043
If it is used
Figure GDA0003742866510000044
H is set in combination with actual requirements to make the prediction rate more accurate θ0 (x)≥0.7,y 0 1 is ═ 1; otherwise, setting a decision boundary as 0;
the third step: calculating a cost function:
in linear regression, we give the cost function definition:
Figure GDA0003742866510000051
according to the maximum likelihood estimation, the cost function is modified as follows:
Figure GDA0003742866510000052
wherein x is (i) Representing the i-th feature vector, y (i) Representing predicted values, m-tableShowing the record number of the sample; if y ═ y 1 Is easy to know when h θ (x) On → 0, the cost function is close to infinity, knowing that an error is assumed at this time, and vice versa;
the cost function can also be written as follows:
Figure GDA0003742866510000053
the cost function is a convex function, and the optimal solution of the whole office is solved by a gradient descent method;
22-2) feature screening: preliminarily determining candidate characteristics according to various candidate characteristic attributes and histograms and statistical graphs marked by results;
22-3) normalization processing: the normalization processing is put into a module of classification training;
22-4) recursive feature elimination: training an estimator on the initial feature set, and obtaining the importance of each feature through coef _ attributes or through feature _ attributes; deleting the least important features from the current set of features; repeating the process recursively on the pruned set until eventually the desired number of features to be selected is reached;
22-5) execution model: calling a model and calculating prediction precision; the predicted accuracy comprises the original model state which is not subjected to fitting or under-fitting;
22-6) performing 10-fold cross validation: preventing overfitting until the average precision is close to the prediction precision before cross validation;
22-7) model verification: the model verification after the fitting treatment mainly comprises the verification of the following indexes: prediction accuracy, recall values, ROC curves.
Further, in step 23), when performing quantitative regression prediction, the regression model for selecting the quantitative prediction mailbox allocation size includes: and (3) determining concentrations trees, random forms and polymeric Regression, and selecting an algorithm with higher accuracy as an actual application algorithm through cross training test of training data.
The invention has the following beneficial effects:
(1) according to the data comparison, oa personnel classification and mailbox allocation size prediction, the workload of information operation and maintenance personnel can be effectively reduced, the server burden is practically reduced, and the goal of realizing the primary burden reduction is met;
(2) the big data technology (classification and regression) is applied to office automation (oa mailbox), the actual problem of electric power is solved by using the emerging technology, the working efficiency is improved, and meanwhile, a reference can be provided for other applications of the big data technology;
drawings
FIG. 1 is a flow chart of zombie account screening and clearing in accordance with the present invention;
FIG. 2 is a flow chart of mailbox size prediction and allocation in accordance with the present invention;
FIG. 3 is a functional image of a Sigmoid of the present invention;
FIG. 4 shows the Cost (h) of the present invention θ (x) Y) function image.
Detailed Description
The invention provides an account clearing and backing and memory allocation prediction method for an office automation system, which comprises two processing modules: a zombie account screening and clearing module and a mailbox size predicting and distributing module; the method has the advantages that the python office automation technology, the information extraction technology and the mainstream big data technology are applied, the zombie account number and the memory are cleared through the zombie account number screening and clearing module, and the reasonable mailbox memory which is to be distributed for different types of workers is predicted by combining with the oa office automation server log record, so that the server burden is reduced, and the convenience is provided for the workers while the office automation server is ensured to run stably.
As shown in fig. 1, the processing steps of the zombie account screening and clearing module include:
11) and (3) data comparison: the person information table and the communication record table are compared firstly, and then the address book and the background data file are compared.
12) The filtering and analyzing of the activity log record specifically comprises the following steps:
12-1) exporting an automated office activity log;
12-2) performing log analysis by combing log composition, function classification and attribute values;
12-3) writing a regular expression, and further screening dates, personnel and operation records in activity records containing 'X deliVer to Y' and 'Replicate';
12-4) judging whether the data contains all activity records; if all the activity records are contained, calculating the activity degree according to the receiving and sending records, the operation records and the backup records; and if not, returning to the step 12-3) for re-screening.
Wherein, the activity calculation specifically comprises: counting the checking, sending, deleting and setting frequency of each mailbox of the company by taking a month as a unit in combination with the result of log filtering; setting the frequency of the continuous two-month counting, sending and sending as 0, and taking the activity value as 0, otherwise taking 1.
13) Comparing the results of the activity degrees in the step 12), and after the list is determined, issuing the bulletin and cleaning regularly.
Wherein the personnel list refers to a list of the persons in the job provided by the human resources department; the communication record table refers to personnel data in an office automation oa management page; background data files refer to background mapping files used to manage data in pages (like databases, where one person may have multiple accounts in a management interface, the same nsf file may be mapped to the background).
As shown in fig. 2, the processing steps of the mailbox size prediction and allocation module include:
21) the data preprocessing specifically comprises the following steps:
21-1) log acquisition and filtering: and exporting an oa office automation log file nsf, and compiling all daily operation records of filtering, receiving, sending, setting and the like of different regular expressions.
21-2) feature value extraction and calculation: and (4) designing all the characteristic fields by combining the actual working content, and continuously eliminating the characteristic fields in the training process. Wherein the characteristic field includes: the method comprises the steps of total memory amount, used memory, memory utilization rate, memory monthly change rate, mail receiving and sending frequency, attachment size threshold, mailbox local backup degree, mailbox local backup space size, self-cleaning degree, mailbox size record number adjustment, mailbox adjustment change value and whether classification prediction results are adjusted or not.
In addition, the calculation method of the total amount of the existing memory, the existing used memory and the memory utilization rate is as follows:
21-2-1) calculating the total memory amount, the existing used memory and the memory utilization rate in a month unit since the creation of the mailbox application;
21-2-2) respectively calculating mode values, average values and weighted average values of the three types of attributes, respectively using the three types of values as feature vectors, and subsequently selecting corresponding values according to the prediction accuracy.
The memory usage rate is a month unit, and the memory month change rates of two adjacent months are respectively calculated, wherein the total memory amount (M2-M1)/M1, and M1 and M2 are the memory usage rates of two months in front of and behind.
The number of the received mails and the number of the sent mails can be counted according to the log operation records.
The attachment size threshold is in units of months and the average attachment size is calculated.
The mailbox local backup can effectively backup the content of the mailbox to the local disk, thereby reasonably utilizing the network mailbox space. The acquisition and calculation mode of the characteristic value is mainly determined by combining the log backup record and the dip value in the memory utilization rate array.
The size of the mailbox local backup space is equal to the total amount of the memory, and delta t is the change value of the memory utilization rate in unit time.
The mail deleting frequency is in a month unit, and the self-cleaning degree, namely the mail deleting frequency of the initiative is calculated.
And counting the times of adjusting the size of the mailbox and the size of the increased space after the mailbox is created.
21-3) normalization treatment: after the feature vector of the feature field is calculated, the value difference is large, the proportion of each feature vector value after all the values is calculated respectively and used as each feature vector value, the value range is zoomed, and the discrimination is increased.
In actual operation, each mailbox is allocated with a fixed memory by default, such as 2G. Therefore, the classification training test is to select user mailboxes with a probability value larger than a certain probability value, classify different types of mailbox users, judge which people need to adjust the mailbox size and which people do not need to adjust the mailbox size, and then complete prediction of mailbox allocation size values.
22) Carrying out classification training test, and specifically comprising the following steps:
22-1) selecting a logistic regression algorithm, and finishing initialization of corresponding parameters by combining a logistic regression principle and application requirements:
suppose that:
y 1 1 is a positive class in the two classes, namely the mailbox size needs to be adjusted;
y 2 0 is the inverse of the subclass, i.e., no mailbox size adjustment is required.
The first step is as follows: assuming the function:
first, a Sigmoid function is defined:
Figure GDA0003742866510000091
the functional image is shown in fig. 3 as follows:
in the linear regression algorithm, the hypothesis function is defined as h θ x=θ T x, wherein θ is a parameter; at this time, the range of the assumed function may be (— ∞, + ∞); in dichotomy, the output y can only take the value of 0 or 1, at θ T And x is wrapped with a layer of Sigmoid function, so that the value range of the Sigmoid function belongs to (0, 1), and the following definitions are given:
Figure GDA0003742866510000101
wherein P represents the corresponding probability when y outputs 0 or 1; if h θ x is 0.8, which means that there is a probability y of 80% 1 I.e. representing the probability that y is 1 when the input is x; accordingly, y ═ y 2 When y is 0, the probability is 20%.
The threshold may be adjusted to account for practical considerations, and if the threshold is set to 0.9, i.e., more than 90% confidence is present, then y is deemed to belong to this class. Thus, the binary classification problem is converted into a probabilistic problem.
The second step is that:
from the above assumption that the function represents the probability, it can be deduced that:
if it is not
Figure GDA0003742866510000102
If it is not
Figure GDA0003742866510000103
Order to
Figure GDA0003742866510000104
Then theta T x-0 is the decision boundary.
H is set in combination with actual requirements to make the prediction rate more accurate θ0 (x)≥0.7,y 0 1 is ═ 1; otherwise, setting the decision boundary as 0.
The third step: calculate cost function (to optimize objective):
in linear regression, we give the cost function definition:
Figure GDA0003742866510000105
since it is a convex function, it can be solved directly with gradient descent, local minimum, i.e. global minimum. Wherein x (i) Representing the i-th feature vector, y (i) Denotes the predicted value, and m denotes the number of records of the sample.
But in logical regression, h θ (x) The method is a complex nonlinear function, belongs to a non-convex function, and can be trapped in a local minimum value by directly using gradient descent.
From the Maximum likelihood Estimate (Maximum likehood Estimate), the cost function is modified as follows:
Figure GDA0003742866510000111
if y ═ y 1 I.e. when y is 1, Cost (h) θ (x) Y) function ofThe image is shown in FIG. 4, where h is easily known θ (x) On → 0 (i.e., y can be determined to be 0), the cost function is close to infinity, and it can be known that an error is assumed at this time, and vice versa;
the cost function can also be written as follows:
Figure GDA0003742866510000112
the cost function at the moment is a convex function, and the overall optimal solution is solved by a gradient descent method.
After data preprocessing and normalization processing, according to the application flow of logistic regression, namely:
22-2) feature screening: and preliminarily determining candidate characteristics according to the histograms and the statistical graphs of various candidate characteristic attributes and result marks.
22-3) normalization treatment: the normalization process should be put into the module of classification training.
22-4) recursive feature elimination (pruning feature vectors, i.e. filtering model): training an estimator on the initial feature set, and obtaining the importance of each feature through a coef attribute or through a feature _ attributes _ attribute; deleting the least important features from the current set of features; this process is repeated recursively over the pruned set until the desired number of features to be selected is eventually reached.
22-5) executing the model (verifying the prediction accuracy of the current model): calling a model and calculating prediction precision; wherein the accuracy of the prediction comprises the original model state which is not subjected to fitting and under fitting.
22-6) performing 10-fold cross validation: and (4) preventing overfitting until the average precision is close to the prediction precision before cross validation.
22-7) model verification: the model verification after the fitting treatment mainly comprises the verification of the following indexes: prediction precision (precision, i.e. precision, the accuracy of correct prediction of a test sample after being calculated by the model), recall value (recall, which intuitively shows the probability that the classifier predicts that mailbox size needs to be adjusted, i.e. y is 1), ROC curve (objective image shows the effect of the classifier, the dotted line represents the ROC curve of a purely random classifier; an excellent classifier should be as far away from the curve as possible (towards the upper left corner)).
23) Carrying out regression training test: and according to the classification result of the step 22), carrying out quantitative regression prediction on the data needing to redistribute the mailbox sizes, thereby achieving the goal of scientifically distributing the mailbox sizes.
The regression model for selecting the quantitative prediction mailbox allocation size comprises the following steps: decision trees, Randoms forest, polymodal Regression; and through cross training test of training data, selecting an algorithm with higher accuracy as an actual application algorithm.
24) The method comprises the following steps: for the input of daily operation record data of the account, classification prediction is firstly completed, and then prediction of mailbox size is scientifically realized.

Claims (6)

1. An account number clearing and memory allocation prediction method for an office automation system is characterized by comprising the following steps: the device comprises two processing modules: a zombie account screening and clearing module and a mailbox size prediction and distribution module;
the processing steps of the zombie account screening and clearing module comprise:
11) and (3) data comparison: comparing the human information table with the communication record table, and then comparing the address book with the background data file;
12) filtering and analyzing activity log records;
13) comparing the results in the step 12), and after determining the list, issuing a notice and regularly cleaning;
wherein the personnel list refers to a list of the persons in the job provided by the human resources department; the communication record table refers to personnel data in an office automation oa management page; the background data file refers to a background mapping file used for managing data in the page;
the mailbox size predicting and distributing module comprises the following processing steps:
21) the data preprocessing specifically comprises the following steps:
21-1) log acquisition and filtering: exporting an oa office automation log file nsf, compiling different regular expressions for filtering, receiving, sending and setting all daily operation records;
21-2) feature value extraction and calculation: designing all characteristic fields by combining the actual working content, and continuously eliminating the characteristic fields in the training process;
wherein the characteristic field includes: the method comprises the steps of (1) adjusting the total memory amount, the used memory, the memory utilization rate, the memory monthly change rate, the mail receiving and sending frequency, the attachment size threshold, the mailbox local backup degree, the mailbox local backup space size, the self-cleaning degree, the mailbox size record number, the mailbox adjustment change value and whether the classification prediction result is adjusted or not;
21-3) normalization treatment: after the characteristic vectors of the characteristic fields are calculated, the size difference of the values is large, the proportion of each characteristic vector value to all the values is calculated respectively and used as each characteristic vector value, the value range is zoomed, and the discrimination is increased;
22) carrying out classification training test;
23) carrying out regression training test: carrying out quantitative regression prediction on the data needing to reallocate the mailbox sizes according to the classification result of the step 22);
24) and for the input of daily operation record data of the account, firstly completing classification prediction and then completing prediction of the size of the mailbox.
2. The account clearing and memory allocation prediction method for the office automation system as set forth in claim 1, wherein: the step 12) specifically comprises the following steps:
12-1) exporting an automated office activity log; 12-2) performing log analysis by combing log composition, function classification and attribute values; 12-3) writing a regular expression, and further screening dates, personnel and operation records in activity records containing 'Xdeliverer to Y' and 'replate'; 12-4) judging whether the data contains all activity records; if all the activity records are contained, calculating the activity degree according to the receiving and sending records, the operation records and the backup records; and if not, returning to the step 12-3) for re-screening.
3. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 2, wherein: the activity degree calculation in the step 12-4) is specifically as follows: counting the checking, sending, deleting and setting frequency of each mailbox of the company by taking a month as a unit according to the result of log filtering; setting the frequency of the continuous two-month counting, sending and setting to be 0, and taking the activity value to be 0, otherwise, taking 1.
4. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 1, wherein: the method for calculating the total amount of the existing memory, the existing used memory and the memory utilization rate in the step 21-2) specifically comprises the following steps:
21-2-1) calculating the total memory amount, the existing used memory and the memory utilization rate in a month unit since the creation of the mailbox application;
21-2-2) respectively calculating mode values, average values and weighted average values of the three types of attributes, respectively using the three types of values as feature vectors, and subsequently selecting corresponding values according to the prediction accuracy.
5. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 1, wherein: step 22) the classification training test is realized by the following steps:
22-1) selecting a logistic regression algorithm, and finishing initialization of corresponding parameters by combining a logistic regression principle and application requirements:
suppose that: y is 1 1 is a positive class in the two classes, namely the mailbox size needs to be adjusted;
y 2 0 is the inverse of the two categories, i.e. mailbox size does not need to be adjusted;
the first step is as follows: assuming the function:
first, a Sigmoid function is defined:
Figure FDA0003742866500000031
in the linear regression algorithm, the hypothesis function isIs defined as h θ x=θ T x, wherein θ is a parameter; at this time, the range of the assumed function may be (— ∞, + ∞); in dichotomy, the output y can only take on values of 0 or 1 at θ T And x is wrapped with a layer of Sigmoid function, so that the value range of the Sigmoid function belongs to (0, 1), and the following definitions are given:
Figure FDA0003742866500000032
wherein P represents the corresponding probability when y outputs 0 or 1;
if h θ x is 0.8, which means that there is a probability y of 80% 1 I.e. representing the probability that y is 1 when the input is x; accordingly, y ═ y 2 When, the probability that y is 0 is 20%;
the second step is that:
from the above assumption that the function represents the probability, it can be deduced that:
if it is not
Figure FDA0003742866500000041
If it is not
Figure FDA0003742866500000042
In combination with actual demand, to increase the prediction rate, h is set at this time θ (x) More than or equal to 0.7, and y is 1; otherwise, setting a decision boundary as 0;
the third step: calculating a cost function:
in linear regression, a cost function definition is given:
Figure FDA0003742866500000043
according to the maximum likelihood estimation, the cost function is modified as follows:
Figure FDA0003742866500000044
wherein x (i) Represents the ith feature vector, y (i) Representing the predicted value, and m represents the record number of the sample; if y ═ y 1 Is easy to know when h θ (x) On an → 0 scale, the cost function is close to infinity, knowing that an error is assumed at this time, and vice versa;
the cost function can also be written as follows:
Figure FDA0003742866500000045
the cost function is a convex function, and the optimal solution of the whole office is solved by a gradient descent method;
22-2) feature screening: preliminarily determining candidate characteristics according to various candidate characteristic attributes and histograms and statistical graphs marked by results;
22-3) normalization treatment: the normalization processing is put into a module of classification training;
22-4) recursive feature elimination: training an estimator on the initial feature set, and obtaining the importance of each feature through coef _ attributes or through feature _ attributes; deleting the least important features from the current set of features; repeating the process recursively on the pruned set until eventually the desired number of features to be selected is reached;
22-5) executing the model: calling a model and calculating prediction precision; the predicted accuracy comprises the original model state which is not subjected to fitting or under-fitting;
22-6) performing 10-fold cross validation: preventing overfitting until the average precision is close to the prediction precision before cross validation;
22-7) model verification: the model verification after the fitting treatment mainly comprises the verification of the following indexes: prediction accuracy, recall values, ROC curves.
6. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 1, wherein: in step 23), when performing quantitative regression prediction, the regression model for selecting the quantitative prediction mailbox allocation size includes: decision trees, Randoms forest, polymeric Regression; and selecting the algorithm with the highest accuracy as the actual application algorithm through the cross training test of the training data.
CN202011125936.9A 2020-10-20 2020-10-20 Account clearing and backing and memory allocation prediction method for office automation system Active CN112232774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011125936.9A CN112232774B (en) 2020-10-20 2020-10-20 Account clearing and backing and memory allocation prediction method for office automation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011125936.9A CN112232774B (en) 2020-10-20 2020-10-20 Account clearing and backing and memory allocation prediction method for office automation system

Publications (2)

Publication Number Publication Date
CN112232774A CN112232774A (en) 2021-01-15
CN112232774B true CN112232774B (en) 2022-09-09

Family

ID=74118117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011125936.9A Active CN112232774B (en) 2020-10-20 2020-10-20 Account clearing and backing and memory allocation prediction method for office automation system

Country Status (1)

Country Link
CN (1) CN112232774B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113068067B (en) * 2021-03-19 2022-08-12 北京达佳互联信息技术有限公司 Account recalling method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101730879A (en) * 2007-02-21 2010-06-09 阿瓦亚公司 Voicemail filtering and transcribing
CN103580919A (en) * 2013-11-04 2014-02-12 复旦大学 Method and system for marking mail user by utilizing mail server blog
CN103973666A (en) * 2013-08-13 2014-08-06 哈尔滨理工大学 Spam botnet host detection method and device
CN104901847A (en) * 2015-05-27 2015-09-09 国家计算机网络与信息安全管理中心 Social network zombie account detection method and device
CN106682118A (en) * 2016-12-08 2017-05-17 华中科技大学 Social network site false fan detection method achieved on basis of network crawler by means of machine learning
CN107800607A (en) * 2016-08-31 2018-03-13 腾讯科技(深圳)有限公司 A kind of method and device for clearing up mailbox data
CN108540431A (en) * 2017-03-03 2018-09-14 阿里巴巴集团控股有限公司 The recognition methods of account type, device and system
CN109783632A (en) * 2019-02-15 2019-05-21 腾讯科技(深圳)有限公司 Customer service information-pushing method, device, computer equipment and storage medium
CN110766427A (en) * 2018-07-25 2020-02-07 阿里巴巴集团控股有限公司 Advertisement bidding method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6515621B2 (en) * 2015-03-25 2019-05-22 富士通株式会社 Mail processing server, mail processing method, and mail processing program

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101730879A (en) * 2007-02-21 2010-06-09 阿瓦亚公司 Voicemail filtering and transcribing
CN103973666A (en) * 2013-08-13 2014-08-06 哈尔滨理工大学 Spam botnet host detection method and device
CN103580919A (en) * 2013-11-04 2014-02-12 复旦大学 Method and system for marking mail user by utilizing mail server blog
CN104901847A (en) * 2015-05-27 2015-09-09 国家计算机网络与信息安全管理中心 Social network zombie account detection method and device
CN107800607A (en) * 2016-08-31 2018-03-13 腾讯科技(深圳)有限公司 A kind of method and device for clearing up mailbox data
CN106682118A (en) * 2016-12-08 2017-05-17 华中科技大学 Social network site false fan detection method achieved on basis of network crawler by means of machine learning
CN108540431A (en) * 2017-03-03 2018-09-14 阿里巴巴集团控股有限公司 The recognition methods of account type, device and system
CN110766427A (en) * 2018-07-25 2020-02-07 阿里巴巴集团控股有限公司 Advertisement bidding method and system
CN109783632A (en) * 2019-02-15 2019-05-21 腾讯科技(深圳)有限公司 Customer service information-pushing method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多维时序日志的异常行为可视分析;张文琦等;《计算机工程与应用》;20200515;全文 *

Also Published As

Publication number Publication date
CN112232774A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN111144468B (en) Method and device for labeling power consumer information, electronic equipment and storage medium
De Fontnouvelle et al. Using loss data to quantify operational risk
US20050080821A1 (en) System and method for managing collections accounts
Ibrahim et al. Inter-dependent, heterogeneous, and time-varying service-time distributions in call centers
CN109636482B (en) Data processing method and system based on similarity model
CN114048436A (en) Construction method and construction device for forecasting enterprise financial data model
CN115249081A (en) Object type prediction method and device, computer equipment and storage medium
CN111738819A (en) Method, device and equipment for screening characterization data
CN113919655A (en) Law enforcement personnel scheduling method, system, computer device and storage medium
CN112232774B (en) Account clearing and backing and memory allocation prediction method for office automation system
CN114078050A (en) Loan overdue prediction method and device, electronic equipment and computer readable medium
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN114782123A (en) Credit assessment method and system
Lyu et al. Application of Queuing Model in Library Service
Rahman et al. To predict customer churn by using different algorithms
CN114676931B (en) Electric quantity prediction system based on data center technology
CN114626940A (en) Data analysis method and device and electronic equipment
CN113850609A (en) Customer management system, method, computer equipment and storage medium
CN111931966A (en) Power supply reliability prediction method based on decision tree regression
Zhang Simulation and analysis of queueing system
CN113723775B (en) Enterprise and industry operation risk assessment method based on power big data
CN114648368B (en) Economic information consultation system and method based on network big data
Yosef et al. Multifactor Customer Classification model for IP Transit product
CN117236532B (en) Load data-based electricity consumption peak load prediction method and system
CN117689410A (en) Method and system for rapidly mining value-added service potential demand customers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant