CN112232774B

CN112232774B - Account clearing and backing and memory allocation prediction method for office automation system

Info

Publication number: CN112232774B
Application number: CN202011125936.9A
Authority: CN
Inventors: 承春明; 赵欣慧; 王永翔; 赵东坡; 刘思远; 陈恩权; 张瑞; 王金珂
Original assignee: Luohe Power Supply Company State Grid Henan Electric Power Co
Current assignee: Luohe Power Supply Company State Grid Henan Electric Power Co
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2022-09-09
Anticipated expiration: 2040-10-20
Also published as: CN112232774A

Abstract

The invention discloses an account clearing and returning and memory allocation prediction method for an office automation system, which comprises a zombie account screening and clearing module and a mailbox size prediction and allocation module; the processing steps of zombie account screening and returning comprise: 11) and (3) data comparison: comparing the person information table with the address book, and then comparing the address book with the background data file; 12) filtering and analyzing activity log records; 13) comparing the results of the activity degrees, and after determining the list, issuing the bulletin and cleaning the bulletin regularly. The mailbox size prediction and allocation processing step comprises the following steps: 21) data preprocessing, 22) performing classification training test; 23) carrying out regression training test: 24) the method is applied. The method can scientifically and reasonably clear and retreat the zombie account number and the memory, and predict the reasonable mailbox memory which should be distributed for different kinds of workers by combining the journal record of the oa office automation server in the past period, thereby reducing the burden of the server, ensuring the stable operation of the office automation server and providing convenience for the workers.

Description

Account clearing and backing and memory allocation prediction method for office automation system

Technical Field

The invention belongs to the field of office automation system management, and particularly relates to an account clearing and memory allocation prediction method for an office automation system.

Background

At present, office automation systems adopted in city and county in various places of some big companies are old, and with increasing mail forwarding and storage, a plurality of temporary account numbers opened by city and county shift, borrowing, third-party companies and the like in the past year increase, so that risks are increased for stable operation of servers. How to reasonably utilize human resource and personnel information, an address book, background personnel data files and dozens of millions of background log records, scientifically and intelligently clear and quit zombie accounts and data files, and reasonably predict memory values to be allocated according to the log records becomes important.

Disclosure of Invention

In order to solve the defects of the prior art, the invention aims to provide an account clearing and memory allocation prediction method for an office automation system. Through using current python office automation technology, information extraction technology and mainstream big data technology, can scientific rationalization clear move back zombie account number and memory to combine current oa office automation server log record, the prediction should be for the reasonable mailbox memory of different kind employees distribution, alleviate server burden, when the guarantee office automation server steady operation, for the employee facilitates.

In order to achieve the purpose, the invention adopts the technical scheme that:

an account number clearing and memory allocation prediction method for an office automation system comprises two processing modules: a zombie account screening and clearing module and a mailbox size prediction and distribution module;

the processing steps of the zombie account screening and clearing module comprise:

11) and (3) data comparison: comparing the human information table with the communication record table, and then comparing the address book with the background data file;

12) filtering and analyzing activity log records;

13) comparing the results in the step 12), and after determining the list, issuing a notice and regularly cleaning;

wherein the human resources list refers to a list of the persons in the main business provided by the human resources department; the communication record table refers to personnel data in an office automation oa management page; the background data file refers to a background mapping file used for managing data in the page;

the mailbox size predicting and distributing module comprises the following processing steps:

21) the data preprocessing specifically comprises the following steps:

21-1) log acquisition and filtering: exporting an oa office automation log file nsf, compiling all daily operation records of filtering, receiving, sending, setting and the like of different regular expressions;

21-2) feature value extraction and calculation: designing all the characteristic fields by combining the actual work content, and continuously eliminating the characteristic fields in the training process; wherein the characteristic field includes: the method comprises the steps of (1) adjusting the total memory amount, the used memory, the memory utilization rate, the memory monthly change rate, the mail receiving and sending frequency, the attachment size threshold, the mailbox local backup degree, the mailbox local backup space size, the self-cleaning degree, the mailbox size record number, the mailbox adjustment change value and whether the classification prediction result is adjusted or not;

21-3) normalization treatment: after the feature vectors of the feature fields are calculated, the difference of the values is large, the proportion of each feature vector value to all the values is calculated respectively and used as each feature vector value, the value range is zoomed, and the discrimination is increased;

22) carrying out classification training test;

23) carrying out regression training test: according to the classification result of the step 22), carrying out quantitative regression prediction on the data needing to redistribute the mailbox sizes, thereby achieving the goal of scientifically distributing the mailbox sizes;

24) for the input of daily operation record data of the account, classification prediction is firstly completed, and then prediction of the size of the mailbox is scientifically realized after the classification prediction is completed.

Further, step 12 specifically includes the following steps:

12-1) exporting an automated office activity log; 12-2) performing log analysis by combing log composition, function classification and attribute values; 12-3) writing a regular expression, and further screening dates, personnel and operation records in activity records containing 'X delivery to Y' and 'Replicate'; 12-4) judging whether the data contains all activity records; if all the activity records are contained, calculating the activity degree according to the receiving and sending records, the operation records and the backup records; and if not, returning to the step 12-3) for re-screening.

Further, the activity calculation in the step 12-4) specifically comprises: counting the checking, sending, deleting and setting frequency of each mailbox of the company by taking a month as a unit according to the result of log filtering; setting the frequency of the continuous two-month counting, sending and sending as 0, and taking the activity value as 0, otherwise taking 1.

Further, the method for calculating the total amount of the existing memory, the existing used memory and the memory usage rate in the step 21-2) specifically comprises the following steps:

21-2-1) calculating the total memory amount, the existing used memory and the memory utilization rate in a month unit since the creation of the mailbox application;

21-2-2) respectively calculating mode values, average values and weighted average values of the three types of attributes, respectively using the three types of values as feature vectors, and subsequently selecting corresponding values according to the prediction accuracy.

Further, step 22) the classification training test is implemented by the following steps:

22-1) selecting a logistic regression algorithm, and finishing initialization of corresponding parameters by combining a logistic regression principle and application requirements:

suppose that: y is ₁ 1 is a positive class in the two classes, namely the mailbox size needs to be adjusted;

y ₂ 0 is the inverse of the two categories, i.e. mailbox size does not need to be adjusted;

the first step is as follows: assuming the function:

first, a Sigmoid function is defined:

in the linear regression algorithm, the hypothesis function is defined as h _θ x＝θ ^T x, wherein θ is a parameter; at this time, the range of the assumed function may be (— ∞, + ∞); in the case of the two-classification,the output y can only take the value of 0 or 1 at theta ^T x is wrapped with a layer of Sigmoid function to make the value range belong to (0, 1), and the following definition is given:

wherein P represents the corresponding probability when y outputs 0 or 1;

if h _θ x is 0.8, which means that there is a probability y of 80% ₁ I.e. representing the probability that y is 1 when the input is x; accordingly, y ═ y ₂ When, the probability that y is 0 is 20%;

the second step is that:

from the above assumption that the function represents the probability, it can be deduced that:

if it is not

If it is used

H is set in combination with actual requirements to make the prediction rate more accurate _θ0 (x)≥0.7，y ₀ 1 is ═ 1; otherwise, setting a decision boundary as 0;

the third step: calculating a cost function:

in linear regression, we give the cost function definition:

according to the maximum likelihood estimation, the cost function is modified as follows:

wherein x is ⁽ⁱ⁾ Representing the i-th feature vector, y ⁽ⁱ⁾ Representing predicted values, m-tableShowing the record number of the sample; if y ═ y ₁ Is easy to know when h _θ (x) On → 0, the cost function is close to infinity, knowing that an error is assumed at this time, and vice versa;

the cost function can also be written as follows:

the cost function is a convex function, and the optimal solution of the whole office is solved by a gradient descent method;

22-2) feature screening: preliminarily determining candidate characteristics according to various candidate characteristic attributes and histograms and statistical graphs marked by results;

22-3) normalization processing: the normalization processing is put into a module of classification training;

22-4) recursive feature elimination: training an estimator on the initial feature set, and obtaining the importance of each feature through coef _ attributes or through feature _ attributes; deleting the least important features from the current set of features; repeating the process recursively on the pruned set until eventually the desired number of features to be selected is reached;

22-5) execution model: calling a model and calculating prediction precision; the predicted accuracy comprises the original model state which is not subjected to fitting or under-fitting;

22-6) performing 10-fold cross validation: preventing overfitting until the average precision is close to the prediction precision before cross validation;

22-7) model verification: the model verification after the fitting treatment mainly comprises the verification of the following indexes: prediction accuracy, recall values, ROC curves.

Further, in step 23), when performing quantitative regression prediction, the regression model for selecting the quantitative prediction mailbox allocation size includes: and (3) determining concentrations trees, random forms and polymeric Regression, and selecting an algorithm with higher accuracy as an actual application algorithm through cross training test of training data.

The invention has the following beneficial effects:

(1) according to the data comparison, oa personnel classification and mailbox allocation size prediction, the workload of information operation and maintenance personnel can be effectively reduced, the server burden is practically reduced, and the goal of realizing the primary burden reduction is met;

(2) the big data technology (classification and regression) is applied to office automation (oa mailbox), the actual problem of electric power is solved by using the emerging technology, the working efficiency is improved, and meanwhile, a reference can be provided for other applications of the big data technology;

drawings

FIG. 1 is a flow chart of zombie account screening and clearing in accordance with the present invention;

FIG. 2 is a flow chart of mailbox size prediction and allocation in accordance with the present invention;

FIG. 3 is a functional image of a Sigmoid of the present invention;

FIG. 4 shows the Cost (h) of the present invention _θ (x) Y) function image.

Detailed Description

The invention provides an account clearing and backing and memory allocation prediction method for an office automation system, which comprises two processing modules: a zombie account screening and clearing module and a mailbox size predicting and distributing module; the method has the advantages that the python office automation technology, the information extraction technology and the mainstream big data technology are applied, the zombie account number and the memory are cleared through the zombie account number screening and clearing module, and the reasonable mailbox memory which is to be distributed for different types of workers is predicted by combining with the oa office automation server log record, so that the server burden is reduced, and the convenience is provided for the workers while the office automation server is ensured to run stably.

As shown in fig. 1, the processing steps of the zombie account screening and clearing module include:

11) and (3) data comparison: the person information table and the communication record table are compared firstly, and then the address book and the background data file are compared.

12) The filtering and analyzing of the activity log record specifically comprises the following steps:

12-1) exporting an automated office activity log;

12-2) performing log analysis by combing log composition, function classification and attribute values;

12-3) writing a regular expression, and further screening dates, personnel and operation records in activity records containing 'X deliVer to Y' and 'Replicate';

12-4) judging whether the data contains all activity records; if all the activity records are contained, calculating the activity degree according to the receiving and sending records, the operation records and the backup records; and if not, returning to the step 12-3) for re-screening.

Wherein, the activity calculation specifically comprises: counting the checking, sending, deleting and setting frequency of each mailbox of the company by taking a month as a unit in combination with the result of log filtering; setting the frequency of the continuous two-month counting, sending and sending as 0, and taking the activity value as 0, otherwise taking 1.

13) Comparing the results of the activity degrees in the step 12), and after the list is determined, issuing the bulletin and cleaning regularly.

Wherein the personnel list refers to a list of the persons in the job provided by the human resources department; the communication record table refers to personnel data in an office automation oa management page; background data files refer to background mapping files used to manage data in pages (like databases, where one person may have multiple accounts in a management interface, the same nsf file may be mapped to the background).

As shown in fig. 2, the processing steps of the mailbox size prediction and allocation module include:

21) the data preprocessing specifically comprises the following steps:

21-1) log acquisition and filtering: and exporting an oa office automation log file nsf, and compiling all daily operation records of filtering, receiving, sending, setting and the like of different regular expressions.

21-2) feature value extraction and calculation: and (4) designing all the characteristic fields by combining the actual working content, and continuously eliminating the characteristic fields in the training process. Wherein the characteristic field includes: the method comprises the steps of total memory amount, used memory, memory utilization rate, memory monthly change rate, mail receiving and sending frequency, attachment size threshold, mailbox local backup degree, mailbox local backup space size, self-cleaning degree, mailbox size record number adjustment, mailbox adjustment change value and whether classification prediction results are adjusted or not.

In addition, the calculation method of the total amount of the existing memory, the existing used memory and the memory utilization rate is as follows:

The memory usage rate is a month unit, and the memory month change rates of two adjacent months are respectively calculated, wherein the total memory amount (M2-M1)/M1, and M1 and M2 are the memory usage rates of two months in front of and behind.

The number of the received mails and the number of the sent mails can be counted according to the log operation records.

The attachment size threshold is in units of months and the average attachment size is calculated.

The mailbox local backup can effectively backup the content of the mailbox to the local disk, thereby reasonably utilizing the network mailbox space. The acquisition and calculation mode of the characteristic value is mainly determined by combining the log backup record and the dip value in the memory utilization rate array.

The size of the mailbox local backup space is equal to the total amount of the memory, and delta t is the change value of the memory utilization rate in unit time.

The mail deleting frequency is in a month unit, and the self-cleaning degree, namely the mail deleting frequency of the initiative is calculated.

And counting the times of adjusting the size of the mailbox and the size of the increased space after the mailbox is created.

21-3) normalization treatment: after the feature vector of the feature field is calculated, the value difference is large, the proportion of each feature vector value after all the values is calculated respectively and used as each feature vector value, the value range is zoomed, and the discrimination is increased.

In actual operation, each mailbox is allocated with a fixed memory by default, such as 2G. Therefore, the classification training test is to select user mailboxes with a probability value larger than a certain probability value, classify different types of mailbox users, judge which people need to adjust the mailbox size and which people do not need to adjust the mailbox size, and then complete prediction of mailbox allocation size values.

22) Carrying out classification training test, and specifically comprising the following steps:

suppose that:

y ₁ 1 is a positive class in the two classes, namely the mailbox size needs to be adjusted;

y ₂ 0 is the inverse of the subclass, i.e., no mailbox size adjustment is required.

The first step is as follows: assuming the function:

first, a Sigmoid function is defined:

the functional image is shown in fig. 3 as follows:

in the linear regression algorithm, the hypothesis function is defined as h _θ x＝θ ^T x, wherein θ is a parameter; at this time, the range of the assumed function may be (— ∞, + ∞); in dichotomy, the output y can only take the value of 0 or 1, at θ ^T And x is wrapped with a layer of Sigmoid function, so that the value range of the Sigmoid function belongs to (0, 1), and the following definitions are given:

wherein P represents the corresponding probability when y outputs 0 or 1; if h _θ x is 0.8, which means that there is a probability y of 80% ₁ I.e. representing the probability that y is 1 when the input is x; accordingly, y ═ y ₂ When y is 0, the probability is 20%.

The threshold may be adjusted to account for practical considerations, and if the threshold is set to 0.9, i.e., more than 90% confidence is present, then y is deemed to belong to this class. Thus, the binary classification problem is converted into a probabilistic problem.

The second step is that:

if it is not

If it is not

Order to

Then theta ^T x-0 is the decision boundary.

H is set in combination with actual requirements to make the prediction rate more accurate _θ0 (x)≥0.7，y ₀ 1 is ═ 1; otherwise, setting the decision boundary as 0.

The third step: calculate cost function (to optimize objective):

in linear regression, we give the cost function definition:

since it is a convex function, it can be solved directly with gradient descent, local minimum, i.e. global minimum. Wherein x ⁽ⁱ⁾ Representing the i-th feature vector, y ⁽ⁱ⁾ Denotes the predicted value, and m denotes the number of records of the sample.

But in logical regression, h _θ (x) The method is a complex nonlinear function, belongs to a non-convex function, and can be trapped in a local minimum value by directly using gradient descent.

From the Maximum likelihood Estimate (Maximum likehood Estimate), the cost function is modified as follows:

if y ═ y ₁ I.e. when y is 1, Cost (h) _θ (x) Y) function ofThe image is shown in FIG. 4, where h is easily known _θ (x) On → 0 (i.e., y can be determined to be 0), the cost function is close to infinity, and it can be known that an error is assumed at this time, and vice versa;

the cost function can also be written as follows:

the cost function at the moment is a convex function, and the overall optimal solution is solved by a gradient descent method.

After data preprocessing and normalization processing, according to the application flow of logistic regression, namely:

22-2) feature screening: and preliminarily determining candidate characteristics according to the histograms and the statistical graphs of various candidate characteristic attributes and result marks.

22-3) normalization treatment: the normalization process should be put into the module of classification training.

22-4) recursive feature elimination (pruning feature vectors, i.e. filtering model): training an estimator on the initial feature set, and obtaining the importance of each feature through a coef attribute or through a feature _ attributes _ attribute; deleting the least important features from the current set of features; this process is repeated recursively over the pruned set until the desired number of features to be selected is eventually reached.

22-5) executing the model (verifying the prediction accuracy of the current model): calling a model and calculating prediction precision; wherein the accuracy of the prediction comprises the original model state which is not subjected to fitting and under fitting.

22-6) performing 10-fold cross validation: and (4) preventing overfitting until the average precision is close to the prediction precision before cross validation.

22-7) model verification: the model verification after the fitting treatment mainly comprises the verification of the following indexes: prediction precision (precision, i.e. precision, the accuracy of correct prediction of a test sample after being calculated by the model), recall value (recall, which intuitively shows the probability that the classifier predicts that mailbox size needs to be adjusted, i.e. y is 1), ROC curve (objective image shows the effect of the classifier, the dotted line represents the ROC curve of a purely random classifier; an excellent classifier should be as far away from the curve as possible (towards the upper left corner)).

23) Carrying out regression training test: and according to the classification result of the step 22), carrying out quantitative regression prediction on the data needing to redistribute the mailbox sizes, thereby achieving the goal of scientifically distributing the mailbox sizes.

The regression model for selecting the quantitative prediction mailbox allocation size comprises the following steps: decision trees, Randoms forest, polymodal Regression; and through cross training test of training data, selecting an algorithm with higher accuracy as an actual application algorithm.

24) The method comprises the following steps: for the input of daily operation record data of the account, classification prediction is firstly completed, and then prediction of mailbox size is scientifically realized.

Claims

1. An account number clearing and memory allocation prediction method for an office automation system is characterized by comprising the following steps: the device comprises two processing modules: a zombie account screening and clearing module and a mailbox size prediction and distribution module;

12) filtering and analyzing activity log records;

wherein the personnel list refers to a list of the persons in the job provided by the human resources department; the communication record table refers to personnel data in an office automation oa management page; the background data file refers to a background mapping file used for managing data in the page;

21) the data preprocessing specifically comprises the following steps:

21-1) log acquisition and filtering: exporting an oa office automation log file nsf, compiling different regular expressions for filtering, receiving, sending and setting all daily operation records;

21-2) feature value extraction and calculation: designing all characteristic fields by combining the actual working content, and continuously eliminating the characteristic fields in the training process;

wherein the characteristic field includes: the method comprises the steps of (1) adjusting the total memory amount, the used memory, the memory utilization rate, the memory monthly change rate, the mail receiving and sending frequency, the attachment size threshold, the mailbox local backup degree, the mailbox local backup space size, the self-cleaning degree, the mailbox size record number, the mailbox adjustment change value and whether the classification prediction result is adjusted or not;

21-3) normalization treatment: after the characteristic vectors of the characteristic fields are calculated, the size difference of the values is large, the proportion of each characteristic vector value to all the values is calculated respectively and used as each characteristic vector value, the value range is zoomed, and the discrimination is increased;

22) carrying out classification training test;

23) carrying out regression training test: carrying out quantitative regression prediction on the data needing to reallocate the mailbox sizes according to the classification result of the step 22);

24) and for the input of daily operation record data of the account, firstly completing classification prediction and then completing prediction of the size of the mailbox.

2. The account clearing and memory allocation prediction method for the office automation system as set forth in claim 1, wherein: the step 12) specifically comprises the following steps:

12-1) exporting an automated office activity log; 12-2) performing log analysis by combing log composition, function classification and attribute values; 12-3) writing a regular expression, and further screening dates, personnel and operation records in activity records containing 'Xdeliverer to Y' and 'replate'; 12-4) judging whether the data contains all activity records; if all the activity records are contained, calculating the activity degree according to the receiving and sending records, the operation records and the backup records; and if not, returning to the step 12-3) for re-screening.

3. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 2, wherein: the activity degree calculation in the step 12-4) is specifically as follows: counting the checking, sending, deleting and setting frequency of each mailbox of the company by taking a month as a unit according to the result of log filtering; setting the frequency of the continuous two-month counting, sending and setting to be 0, and taking the activity value to be 0, otherwise, taking 1.

4. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 1, wherein: the method for calculating the total amount of the existing memory, the existing used memory and the memory utilization rate in the step 21-2) specifically comprises the following steps:

5. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 1, wherein: step 22) the classification training test is realized by the following steps:

the first step is as follows: assuming the function:

first, a Sigmoid function is defined:

in the linear regression algorithm, the hypothesis function isIs defined as h _θ x＝θ ^T x, wherein θ is a parameter; at this time, the range of the assumed function may be (— ∞, + ∞); in dichotomy, the output y can only take on values of 0 or 1 at θ ^T And x is wrapped with a layer of Sigmoid function, so that the value range of the Sigmoid function belongs to (0, 1), and the following definitions are given:

wherein P represents the corresponding probability when y outputs 0 or 1;

the second step is that:

if it is not

If it is not

In combination with actual demand, to increase the prediction rate, h is set at this time _θ (x) More than or equal to 0.7, and y is 1; otherwise, setting a decision boundary as 0;

the third step: calculating a cost function:

in linear regression, a cost function definition is given:

wherein x ⁽ⁱ⁾ Represents the ith feature vector, y ⁽ⁱ⁾ Representing the predicted value, and m represents the record number of the sample; if y ═ y ₁ Is easy to know when h _θ (x) On an → 0 scale, the cost function is close to infinity, knowing that an error is assumed at this time, and vice versa;

the cost function can also be written as follows:

22-3) normalization treatment: the normalization processing is put into a module of classification training;

22-5) executing the model: calling a model and calculating prediction precision; the predicted accuracy comprises the original model state which is not subjected to fitting or under-fitting;

6. The account clearing and memory allocation prediction method for an office automation system as set forth in claim 1, wherein: in step 23), when performing quantitative regression prediction, the regression model for selecting the quantitative prediction mailbox allocation size includes: decision trees, Randoms forest, polymeric Regression; and selecting the algorithm with the highest accuracy as the actual application algorithm through the cross training test of the training data.