CN112990382A

CN112990382A - Base station common-site identification method based on big data

Info

Publication number: CN112990382A
Application number: CN202110509326.7A
Authority: CN
Inventors: 寇红侠
Original assignee: Orange Frame Technology Jiangsu Co ltd
Current assignee: Orange Frame Technology Jiangsu Co ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-06-18
Anticipated expiration: 2041-05-11
Also published as: CN112990382B

Abstract

The invention discloses a base station common-station address identification method based on big data, in particular to the field of base station common-station address identification, which comprises the following steps: s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, user longitude, user latitude, local cell longitude, local cell latitude, and whether the cell is marked by co-station. According to the invention, data measured by different network frequency band signals in the wireless environment measurement report MR are cleaned, and then the classification of common-site sites is realized by adopting a machine learning method, so that the influence based on inaccurate site information in a resource management system is successfully overcome, whether the base station is a shared base station or not can be accurately identified, powerful support is provided for the landing of common-site sharing of operators, and the method is a scientific, effective and low-cost solution.

Description

Base station common-site identification method based on big data

Technical Field

The invention relates to the technical field of base station common-station address identification, in particular to a base station common-station address identification method based on big data.

Background

The wireless environment measurement report MR data in the mobile communication network can accurately reflect the coverage condition of the network, and provides good tool support for operators to know the coverage of the wireless network. And the good network coverage is the fundamental guarantee for the survival of operators. However, as the mobile communication network further evolves, especially 4G gradually changes to 5G network, the wavelength of the signal band used by the wireless network becomes shorter and shorter, resulting in the multiplied station building scale. According to incomplete statistics, the scale of the existing 4G sites in China reaches more than 400 million, and the scale of the 5G sites is more than 3 times that of the 4G sites, so that the total investment cost of operators is directly increased.

The co-station sharing is a good optimization cost strategy, and currently, an iron tower group is already established in three operators (China Mobile, China telecom and China Unicom), the iron tower group carries out base station construction, and then leases the base station to the three operators, and the three operators pay according to the use conditions. Due to the historical problem, three operators have a large number of self sites, so that the existing sites of the operators cannot be well distinguished and classified from shared sites, the allocation of the site cost of an iron tower group is further influenced, if the existing sites are distinguished, the sites are classified mainly by the longitude and latitude information of the sites, but the existing sites are not accurately classified due to the fact that the basic information of the sites in the existing operator resource management system is greatly inconsistent with the information of the sites in the field (mainly because the resource management system cannot be updated in time after the sites are migrated in the later period).

Disclosure of Invention

In order to overcome the above defects in the prior art, embodiments of the present invention provide a method for identifying a co-site of a base station based on big data, which cleans data measured by a frequency band signal of a different network in a measurement report MR of a wireless environment, and then classifies co-site sites by using a machine learning method, so as to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a base station common-site identification method based on big data comprises the following steps:

s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;

s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;

s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;

s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;

s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;

s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.

Further, the step S2 includes the following sub-steps:

s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;

s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;

s23, selecting MR sampling points with the RSRP value of the cell in a certain range for the data obtained in the step S22, counting the number of the MR sampling points of each base station for the processed data according to the base station Siteid, and reserving the base station MR sampling points with the number of the MR sampling points larger than a set value.

Further, the algorithm of the random forest in the step S4 includes the following steps:

s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;

s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;

s413, generating all decision trees completely without pruning;

s414, the category of the terminal node is determined by the mode category corresponding to the node;

s415, for the new observation point, the observation points are classified by all trees, and the classification is generated by a majority rule.

Further, the algorithm of GBDT in step S4 includes the following steps:

s421, initializing the estimated values of all samples in K classes, F_k(X) is a matrix, which can be initialized to all 0's or set randomly;

s422, circulating the following learning updating process for M times;

s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:

the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;

s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;

s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:

deriving the cost function to obtain:

s426, learning a regression tree of J leaf nodes along a gradient method,

we input all samples

And the residual error of the probability of each sample on the Kth category is used as an updating direction, the regression tree with J leaves is learned, and the learning basic process is similar to that of the regression tree: traversing the feature dimension of the sample, and selectingSelecting one feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and once J leaf nodes are learned, the learning is stopped;

s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:

s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:

in the K-th class of the M-th iteration, the estimated values F of all samples can be obtained by summing the gain values of all J leaf nodes in the previous iteration M-1 and multiplying the sum by the vector 1, so that after the M times of iterative learning, the final estimated matrices of all samples in all classes can be obtained, and based on the estimated value matrices, multi-class classification can be realized.

Further, the algorithm of Xgboost in step S4 includes the following steps:

s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;

introducing a regularization term omega (f)_t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;

s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:

s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:

our objective function can be converted into:

now we derive wj and let the derivative be 0, we can:

s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:

using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it is still necessary to list all the partitioning schemes for each partitioning. In practice we first take all samples g_iAccording to the division mode, the GL and GR can be divided by scanning the sample once, and then division is carried out according to the fraction of Gain.

Further, the verification in step S5 is to calculate the accuracy, recall, and F of each model₁The value, its formula of calculation is as follows:

wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;

as can be seen from the definition of recall and accuracy, to some extent, an increase in one of the two will have a probability of causing the other to be accurateIs thus F₁The values can be compared to comprehensively display the recognition effect according to F of the three models on the test set₁Values, comparing their sizes, selecting F₁And the model with the maximum value is the final model, and the classification result is output.

The invention has the technical effects and advantages that:

compared with the prior art, the method and the device have the advantages that the data measured by the different network frequency band signals in the wireless environment measurement report MR are cleaned, and then the machine learning method is adopted to realize the classification of the common-site sites. The verification proves that the method successfully overcomes the influence of inaccurate station information in a resource management system, can accurately identify whether the base station is a shared base station, provides powerful support for landing of co-station sharing of operators, and is a scientific, effective and low-cost solution.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a method for identifying a common station address of a base station based on big data includes the following steps:

step S2 includes the following substeps:

s23, selecting MR sampling points of the RSRP value of the cell in a certain range from the data obtained in the step S22, counting the number of the MR sampling points of each base station according to the base station Siteid from the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is more than a set value;

the algorithm of the random forest comprises the following steps:

s413, generating all decision trees completely without pruning;

s415, classifying the new observation points by using all trees, wherein the categories are generated by a majority decision principle;

the algorithm of GBDT includes the following steps:

s422, circulating the following learning updating process for M times;

deriving the cost function to obtain:

s426, learning a regression tree of J leaf nodes along a gradient method,

we input all samples

And the residual error of the probability of each sample on the Kth category is used as an updating direction, the regression tree with J leaves is learned, and the learning basic process is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and stopping learning once J leaf nodes are learned;

in the mth iteration, under the K-th iteration, the estimated values F of all samples can be obtained through the previous iteration M-1, the estimated values of the samples and a gain vector, the gain vector needs to sum the gain values of all J leaf nodes and then multiply the gain values with the vector 1, therefore, after the M times of iterative learning, the final estimated matrixes of all samples under all categories can be obtained, and the multi-category classification can be realized based on the estimated value matrixes;

the algorithm of Xgboost comprises the following steps:

our objective function can be converted into:

now we derive wj and let the derivative be 0, we can:

using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it is still necessary to list all the partitioning schemes for each partitioning. In practice we first take all samples g_iAccording to the segmentation mode, the GL and GR can be segmented as long as a sample is scanned once, and then segmentation is carried out according to the fraction of Gain;

the verification in step S5 is to calculate the accuracy, recall, and F1 value of each model, and the calculation formula is as follows:

it can be known from the definitions of the recall rate and the accuracy that the improvement of one of the two accuracy rates to a certain extent has probability to cause the reduction of the other accuracy rate, so that the F1 value can compare the comprehensive display identification effect, the sizes of the three models are compared according to the F1 values of the three models on a test set, the model with the largest F1 value is selected as a final model, and the classification result is output;

The points to be finally explained are: first, in the description of the present application, it should be noted that, unless otherwise specified and limited, the terms "mounted," "connected," and "connected" should be understood broadly, and may be a mechanical connection or an electrical connection, or a communication between two elements, and may be a direct connection, and "upper," "lower," "left," and "right" are only used to indicate a relative positional relationship, and when the absolute position of the object to be described is changed, the relative positional relationship may be changed;

secondly, the method comprises the following steps: in the drawings of the disclosed embodiments of the invention, only the structures related to the disclosed embodiments are referred to, other structures can refer to common designs, and the same embodiment and different embodiments of the invention can be combined with each other without conflict;

and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. A base station common-site identification method based on big data is characterized by comprising the following steps:

2. The big data based co-sited site identification method of base station according to claim 1, wherein said step S2 includes the following sub-steps:

3. The big data-based base station co-site identification method according to claim 1, wherein the algorithm of the random forest in the step S4 includes the following steps:

s413, generating all decision trees completely without pruning;

4. The big data based co-sited site identification method of claim 1, wherein the algorithm of GBDT in step S4 includes the following steps:

s422, circulating the following learning updating process for M times;

P_k(X) denotes the probability that a sample X belongs to a certain class k, F_k(X) matrix representing the estimated values of the sample X on class k, F₁(X) represents an estimate of the sample X over class i;

P_k(X) corresponds to the above expression, y_kRepresenting the probability that the sample really belongs to class k;

deriving the cost function to obtain:

P_k,m-1（X_i) Representing the probability, P, that the m-1 th iteration sample belongs to a class k_k,m-1(X) represents the probability of the sample belonging to a certain class in the (m-1) th iteration, the others being consistent with the above representation;

s426, learning a regression tree of J leaf nodes along a gradient method,

we input all samples

And each sample is inThe residual error of the probability on the Kth category is used as an updating direction, a regression tree with J leaves is learned, and the basic learning process is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and stopping learning once J leaf nodes are learned;

y_ikprobability of sample i on class k;

F_km(X) denotes the estimated value of the sample in class k in the mth iteration, F_k,m-1(X) represents the estimated value of the sample in class k in the m-1 iteration; j denotes the number of leaf nodes, γ_jkmThe gain of the leaf node j in the kth class in the mth iteration is shown, and the gain is consistent with the expression;

5. The big data based base station co-site identification method according to claim 1, wherein the algorithm of Xgboost in step S4 comprises the following steps:

w, vector of leaf, q, structure of tree, X, argument, f_t(x) Tree model for argument X, T, number of leaves;

gamma, hyperparameter, weight coefficient, number of T leaf nodes, lambda, hyperparameter, weight coefficient, W_j ²Outputting the square of the score on the leaf node;

ζ (hot) represents the loss function of the model, i represents the ith sample,_irepresents the estimated value of the i-th sample, y_iRepresents the ithTrue value of the sample,/, (_i,y_i) A representative function of_i=y_iWhen it is 0, otherwise it is 1, K represents the number of numbers, Ω (f)_k) The complexity of the kth tree of the above formula;

I_jis defined as the set of samples above the leaf j, i being the ith sample, x_iIs the argument of the ith sample, j is the jth leaf node, q (x)_i) Is x_iThe structural function of (1);

our objective function can be converted into:

t represents the T tree, the number of T leaf nodes, j represents the j leaf node, I_jIn accordance with the above expression, i is the ith sample, w_jRepresents the output fraction above the leaf node, λ is the weight coefficient, W_j ²The square of the output fraction on the leaf node is shown, and gamma is a weight coefficient;

now we derive wj and let the derivative be 0, we can:

in accordance with the above statement;

using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it still needs to list all the partitioning schemes for each partitioning; in practice we first take all samples g_iAccording to the division mode, the GL and GR can be divided by scanning the sample once, and then division is carried out according to the fraction of Gain.

6. The big data-based base station co-site identification method according to claim 1, wherein: the verification in step S5 is to calculate the accuracy, recall rate and F of each model₁The value, its formula of calculation is as follows:

from the definitions of recall and accuracy, it can be seen that to some extent an increase in one of these two accuracies will have a probability of causing a decrease in the other, so F₁The values can be compared to comprehensively display the recognition effect according to F of the three models on the test set₁Values, comparing their sizes, selecting F₁And the model with the maximum value is the final model, and the classification result is output.