CN112990382A - Base station common-site identification method based on big data - Google Patents

Base station common-site identification method based on big data Download PDF

Info

Publication number
CN112990382A
CN112990382A CN202110509326.7A CN202110509326A CN112990382A CN 112990382 A CN112990382 A CN 112990382A CN 202110509326 A CN202110509326 A CN 202110509326A CN 112990382 A CN112990382 A CN 112990382A
Authority
CN
China
Prior art keywords
data
base station
sample
cell
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110509326.7A
Other languages
Chinese (zh)
Other versions
CN112990382B (en
Inventor
寇红侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange Frame Technology Jiangsu Co ltd
Original Assignee
Orange Frame Technology Jiangsu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orange Frame Technology Jiangsu Co ltd filed Critical Orange Frame Technology Jiangsu Co ltd
Priority to CN202110509326.7A priority Critical patent/CN112990382B/en
Publication of CN112990382A publication Critical patent/CN112990382A/en
Application granted granted Critical
Publication of CN112990382B publication Critical patent/CN112990382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/10Scheduling measurement reports ; Arrangements for measurement reports
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a base station common-station address identification method based on big data, in particular to the field of base station common-station address identification, which comprises the following steps: s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, user longitude, user latitude, local cell longitude, local cell latitude, and whether the cell is marked by co-station. According to the invention, data measured by different network frequency band signals in the wireless environment measurement report MR are cleaned, and then the classification of common-site sites is realized by adopting a machine learning method, so that the influence based on inaccurate site information in a resource management system is successfully overcome, whether the base station is a shared base station or not can be accurately identified, powerful support is provided for the landing of common-site sharing of operators, and the method is a scientific, effective and low-cost solution.

Description

Base station common-site identification method based on big data
Technical Field
The invention relates to the technical field of base station common-station address identification, in particular to a base station common-station address identification method based on big data.
Background
The wireless environment measurement report MR data in the mobile communication network can accurately reflect the coverage condition of the network, and provides good tool support for operators to know the coverage of the wireless network. And the good network coverage is the fundamental guarantee for the survival of operators. However, as the mobile communication network further evolves, especially 4G gradually changes to 5G network, the wavelength of the signal band used by the wireless network becomes shorter and shorter, resulting in the multiplied station building scale. According to incomplete statistics, the scale of the existing 4G sites in China reaches more than 400 million, and the scale of the 5G sites is more than 3 times that of the 4G sites, so that the total investment cost of operators is directly increased.
The co-station sharing is a good optimization cost strategy, and currently, an iron tower group is already established in three operators (China Mobile, China telecom and China Unicom), the iron tower group carries out base station construction, and then leases the base station to the three operators, and the three operators pay according to the use conditions. Due to the historical problem, three operators have a large number of self sites, so that the existing sites of the operators cannot be well distinguished and classified from shared sites, the allocation of the site cost of an iron tower group is further influenced, if the existing sites are distinguished, the sites are classified mainly by the longitude and latitude information of the sites, but the existing sites are not accurately classified due to the fact that the basic information of the sites in the existing operator resource management system is greatly inconsistent with the information of the sites in the field (mainly because the resource management system cannot be updated in time after the sites are migrated in the later period).
Disclosure of Invention
In order to overcome the above defects in the prior art, embodiments of the present invention provide a method for identifying a co-site of a base station based on big data, which cleans data measured by a frequency band signal of a different network in a measurement report MR of a wireless environment, and then classifies co-site sites by using a machine learning method, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a base station common-site identification method based on big data comprises the following steps:
s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;
s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;
s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;
s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;
s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;
s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.
Further, the step S2 includes the following sub-steps:
s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;
s23, selecting MR sampling points with the RSRP value of the cell in a certain range for the data obtained in the step S22, counting the number of the MR sampling points of each base station for the processed data according to the base station Siteid, and reserving the base station MR sampling points with the number of the MR sampling points larger than a set value.
Further, the algorithm of the random forest in the step S4 includes the following steps:
s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;
s413, generating all decision trees completely without pruning;
s414, the category of the terminal node is determined by the mode category corresponding to the node;
s415, for the new observation point, the observation points are classified by all trees, and the classification is generated by a majority rule.
Further, the algorithm of GBDT in step S4 includes the following steps:
s421, initializing the estimated values of all samples in K classes, Fk(X) is a matrix, which can be initialized to all 0's or set randomly;
s422, circulating the following learning updating process for M times;
s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:
Figure 607399DEST_PATH_IMAGE001
the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;
s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;
s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:
Figure 772801DEST_PATH_IMAGE002
deriving the cost function to obtain:
Figure 542174DEST_PATH_IMAGE003
s426, learning a regression tree of J leaf nodes along a gradient method,
Figure 109421DEST_PATH_IMAGE004
we input all samples
Figure 673258DEST_PATH_IMAGE005
And the residual error of the probability of each sample on the Kth category is used as an updating direction, the regression tree with J leaves is learned, and the learning basic process is similar to that of the regression tree: traversing the feature dimension of the sample, and selectingSelecting one feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and once J leaf nodes are learned, the learning is stopped;
s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:
Figure 60377DEST_PATH_IMAGE006
s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
Figure 695758DEST_PATH_IMAGE007
in the K-th class of the M-th iteration, the estimated values F of all samples can be obtained by summing the gain values of all J leaf nodes in the previous iteration M-1 and multiplying the sum by the vector 1, so that after the M times of iterative learning, the final estimated matrices of all samples in all classes can be obtained, and based on the estimated value matrices, multi-class classification can be realized.
Further, the algorithm of Xgboost in step S4 includes the following steps:
s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;
Figure 38883DEST_PATH_IMAGE008
introducing a regularization term omega (f)t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;
Figure 101517DEST_PATH_IMAGE009
s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:
Figure 913615DEST_PATH_IMAGE010
Figure 87108DEST_PATH_IMAGE011
s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:
Figure 363368DEST_PATH_IMAGE012
our objective function can be converted into:
Figure 534587DEST_PATH_IMAGE013
now we derive wj and let the derivative be 0, we can:
Figure 896298DEST_PATH_IMAGE014
Figure 607902DEST_PATH_IMAGE015
s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:
Figure 676352DEST_PATH_IMAGE016
using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it is still necessary to list all the partitioning schemes for each partitioning. In practice we first take all samples giAccording to the division mode, the GL and GR can be divided by scanning the sample once, and then division is carried out according to the fraction of Gain.
Further, the verification in step S5 is to calculate the accuracy, recall, and F of each model1The value, its formula of calculation is as follows:
Figure 80788DEST_PATH_IMAGE017
Figure 116746DEST_PATH_IMAGE018
Figure 632041DEST_PATH_IMAGE019
wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;
as can be seen from the definition of recall and accuracy, to some extent, an increase in one of the two will have a probability of causing the other to be accurateIs thus F1The values can be compared to comprehensively display the recognition effect according to F of the three models on the test set1Values, comparing their sizes, selecting F1And the model with the maximum value is the final model, and the classification result is output.
The invention has the technical effects and advantages that:
compared with the prior art, the method and the device have the advantages that the data measured by the different network frequency band signals in the wireless environment measurement report MR are cleaned, and then the machine learning method is adopted to realize the classification of the common-site sites. The verification proves that the method successfully overcomes the influence of inaccurate station information in a resource management system, can accurately identify whether the base station is a shared base station, provides powerful support for landing of co-station sharing of operators, and is a scientific, effective and low-cost solution.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for identifying a common station address of a base station based on big data includes the following steps:
s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;
s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;
step S2 includes the following substeps:
s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;
s23, selecting MR sampling points of the RSRP value of the cell in a certain range from the data obtained in the step S22, counting the number of the MR sampling points of each base station according to the base station Siteid from the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is more than a set value;
s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;
s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;
the algorithm of the random forest comprises the following steps:
s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;
s413, generating all decision trees completely without pruning;
s414, the category of the terminal node is determined by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the categories are generated by a majority decision principle;
the algorithm of GBDT includes the following steps:
s421, initializing the estimated values of all samples in K classes, Fk(X) is a matrix, which can be initialized to all 0's or set randomly;
s422, circulating the following learning updating process for M times;
s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:
Figure 617315DEST_PATH_IMAGE001
the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;
s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;
s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:
Figure 130336DEST_PATH_IMAGE002
deriving the cost function to obtain:
Figure 466639DEST_PATH_IMAGE003
s426, learning a regression tree of J leaf nodes along a gradient method,
Figure 723308DEST_PATH_IMAGE004
we input all samples
Figure 563088DEST_PATH_IMAGE005
And the residual error of the probability of each sample on the Kth category is used as an updating direction, the regression tree with J leaves is learned, and the learning basic process is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and stopping learning once J leaf nodes are learned;
s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:
Figure 309327DEST_PATH_IMAGE006
s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
Figure 805031DEST_PATH_IMAGE007
in the mth iteration, under the K-th iteration, the estimated values F of all samples can be obtained through the previous iteration M-1, the estimated values of the samples and a gain vector, the gain vector needs to sum the gain values of all J leaf nodes and then multiply the gain values with the vector 1, therefore, after the M times of iterative learning, the final estimated matrixes of all samples under all categories can be obtained, and the multi-category classification can be realized based on the estimated value matrixes;
the algorithm of Xgboost comprises the following steps:
s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;
Figure 927708DEST_PATH_IMAGE008
introducing a regularization term omega (f)t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;
Figure 74524DEST_PATH_IMAGE009
s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:
Figure 726085DEST_PATH_IMAGE010
Figure 36981DEST_PATH_IMAGE011
s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:
Figure 635453DEST_PATH_IMAGE012
our objective function can be converted into:
Figure 449825DEST_PATH_IMAGE013
now we derive wj and let the derivative be 0, we can:
Figure 537866DEST_PATH_IMAGE014
Figure 8162DEST_PATH_IMAGE015
s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:
Figure 472641DEST_PATH_IMAGE016
using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it is still necessary to list all the partitioning schemes for each partitioning. In practice we first take all samples giAccording to the segmentation mode, the GL and GR can be segmented as long as a sample is scanned once, and then segmentation is carried out according to the fraction of Gain;
s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;
the verification in step S5 is to calculate the accuracy, recall, and F1 value of each model, and the calculation formula is as follows:
Figure 79203DEST_PATH_IMAGE017
Figure 338146DEST_PATH_IMAGE018
Figure 358055DEST_PATH_IMAGE019
wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;
it can be known from the definitions of the recall rate and the accuracy that the improvement of one of the two accuracy rates to a certain extent has probability to cause the reduction of the other accuracy rate, so that the F1 value can compare the comprehensive display identification effect, the sizes of the three models are compared according to the F1 values of the three models on a test set, the model with the largest F1 value is selected as a final model, and the classification result is output;
s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.
The points to be finally explained are: first, in the description of the present application, it should be noted that, unless otherwise specified and limited, the terms "mounted," "connected," and "connected" should be understood broadly, and may be a mechanical connection or an electrical connection, or a communication between two elements, and may be a direct connection, and "upper," "lower," "left," and "right" are only used to indicate a relative positional relationship, and when the absolute position of the object to be described is changed, the relative positional relationship may be changed;
secondly, the method comprises the following steps: in the drawings of the disclosed embodiments of the invention, only the structures related to the disclosed embodiments are referred to, other structures can refer to common designs, and the same embodiment and different embodiments of the invention can be combined with each other without conflict;
and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (6)

1. A base station common-site identification method based on big data is characterized by comprising the following steps:
s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;
s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;
s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;
s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;
s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;
s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.
2. The big data based co-sited site identification method of base station according to claim 1, wherein said step S2 includes the following sub-steps:
s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;
s23, selecting MR sampling points with the RSRP value of the cell in a certain range for the data obtained in the step S22, counting the number of the MR sampling points of each base station for the processed data according to the base station Siteid, and reserving the base station MR sampling points with the number of the MR sampling points larger than a set value.
3. The big data-based base station co-site identification method according to claim 1, wherein the algorithm of the random forest in the step S4 includes the following steps:
s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;
s413, generating all decision trees completely without pruning;
s414, the category of the terminal node is determined by the mode category corresponding to the node;
s415, for the new observation point, the observation points are classified by all trees, and the classification is generated by a majority rule.
4. The big data based co-sited site identification method of claim 1, wherein the algorithm of GBDT in step S4 includes the following steps:
s421, initializing the estimated values of all samples in K classes, Fk(X) is a matrix, which can be initialized to all 0's or set randomly;
s422, circulating the following learning updating process for M times;
s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:
Figure 419136DEST_PATH_IMAGE001
Pk(X) denotes the probability that a sample X belongs to a certain class k, Fk(X) matrix representing the estimated values of the sample X on class k, F1(X) represents an estimate of the sample X over class i;
the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;
s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;
s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:
Figure 745075DEST_PATH_IMAGE002
Pk(X) corresponds to the above expression, ykRepresenting the probability that the sample really belongs to class k;
deriving the cost function to obtain:
Figure 908203DEST_PATH_IMAGE003
Pk,m-1(Xi) Representing the probability, P, that the m-1 th iteration sample belongs to a class kk,m-1(X) represents the probability of the sample belonging to a certain class in the (m-1) th iteration, the others being consistent with the above representation;
s426, learning a regression tree of J leaf nodes along a gradient method,
Figure 825344DEST_PATH_IMAGE004
we input all samples
Figure 339502DEST_PATH_IMAGE005
And each sample is inThe residual error of the probability on the Kth category is used as an updating direction, a regression tree with J leaves is learned, and the basic learning process is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and stopping learning once J leaf nodes are learned;
s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:
Figure 203552DEST_PATH_IMAGE006
yikprobability of sample i on class k;
s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
Figure 955608DEST_PATH_IMAGE007
Fkm(X) denotes the estimated value of the sample in class k in the mth iteration, Fk,m-1(X) represents the estimated value of the sample in class k in the m-1 iteration; j denotes the number of leaf nodes, γjkmThe gain of the leaf node j in the kth class in the mth iteration is shown, and the gain is consistent with the expression;
in the K-th class of the M-th iteration, the estimated values F of all samples can be obtained by summing the gain values of all J leaf nodes in the previous iteration M-1 and multiplying the sum by the vector 1, so that after the M times of iterative learning, the final estimated matrices of all samples in all classes can be obtained, and based on the estimated value matrices, multi-class classification can be realized.
5. The big data based base station co-site identification method according to claim 1, wherein the algorithm of Xgboost in step S4 comprises the following steps:
s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;
Figure 512491DEST_PATH_IMAGE008
w, vector of leaf, q, structure of tree, X, argument, ft(x) Tree model for argument X, T, number of leaves;
introducing a regularization term omega (f)t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;
Figure 45104DEST_PATH_IMAGE009
gamma, hyperparameter, weight coefficient, number of T leaf nodes, lambda, hyperparameter, weight coefficient, Wj 2Outputting the square of the score on the leaf node;
s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:
Figure 712845DEST_PATH_IMAGE010
Figure 582057DEST_PATH_IMAGE011
ζ (hot) represents the loss function of the model, i represents the ith sample,irepresents the estimated value of the i-th sample, yiRepresents the ithTrue value of the sample,/, (i,yi) A representative function ofi=yiWhen it is 0, otherwise it is 1, K represents the number of numbers, Ω (f)k) The complexity of the kth tree of the above formula;
s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:
Figure 44262DEST_PATH_IMAGE012
Ijis defined as the set of samples above the leaf j, i being the ith sample, xiIs the argument of the ith sample, j is the jth leaf node, q (x)i) Is xiThe structural function of (1);
our objective function can be converted into:
Figure 64171DEST_PATH_IMAGE013
t represents the T tree, the number of T leaf nodes, j represents the j leaf node, IjIn accordance with the above expression, i is the ith sample, wjRepresents the output fraction above the leaf node, λ is the weight coefficient, Wj 2The square of the output fraction on the leaf node is shown, and gamma is a weight coefficient;
now we derive wj and let the derivative be 0, we can:
Figure 270024DEST_PATH_IMAGE014
Figure 996672DEST_PATH_IMAGE015
Figure 895357DEST_PATH_IMAGE016
in accordance with the above statement;
s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:
Figure 340245DEST_PATH_IMAGE017
using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it still needs to list all the partitioning schemes for each partitioning; in practice we first take all samples giAccording to the division mode, the GL and GR can be divided by scanning the sample once, and then division is carried out according to the fraction of Gain.
6. The big data-based base station co-site identification method according to claim 1, wherein: the verification in step S5 is to calculate the accuracy, recall rate and F of each model1The value, its formula of calculation is as follows:
Figure 146527DEST_PATH_IMAGE018
Figure 258840DEST_PATH_IMAGE019
Figure 797269DEST_PATH_IMAGE020
wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;
from the definitions of recall and accuracy, it can be seen that to some extent an increase in one of these two accuracies will have a probability of causing a decrease in the other, so F1The values can be compared to comprehensively display the recognition effect according to F of the three models on the test set1Values, comparing their sizes, selecting F1And the model with the maximum value is the final model, and the classification result is output.
CN202110509326.7A 2021-05-11 2021-05-11 Base station co-site identification method based on big data Active CN112990382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110509326.7A CN112990382B (en) 2021-05-11 2021-05-11 Base station co-site identification method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110509326.7A CN112990382B (en) 2021-05-11 2021-05-11 Base station co-site identification method based on big data

Publications (2)

Publication Number Publication Date
CN112990382A true CN112990382A (en) 2021-06-18
CN112990382B CN112990382B (en) 2023-11-21

Family

ID=76337493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110509326.7A Active CN112990382B (en) 2021-05-11 2021-05-11 Base station co-site identification method based on big data

Country Status (1)

Country Link
CN (1) CN112990382B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114401527A (en) * 2021-12-21 2022-04-26 中国电信股份有限公司 Load identification method and device of wireless network and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120120887A1 (en) * 2010-11-12 2012-05-17 Battelle Energy Alliance, Llc Systems, apparatuses, and methods to support dynamic spectrum access in wireless networks
CN103907368A (en) * 2011-12-27 2014-07-02 松下电器产业株式会社 Server device, base station device, and identification number establishment method
CN106131953A (en) * 2016-07-07 2016-11-16 上海奕行信息科技有限公司 A kind of method realizing mobile subscriber location based on frequency weighting in community in the period
CN109302714A (en) * 2018-12-07 2019-02-01 南京华苏科技有限公司 Realize that base station location is studied and judged and area covered knows method for distinguishing based on user data
CN112418445A (en) * 2020-11-09 2021-02-26 深圳市洪堡智慧餐饮科技有限公司 Intelligent site selection fusion method based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120120887A1 (en) * 2010-11-12 2012-05-17 Battelle Energy Alliance, Llc Systems, apparatuses, and methods to support dynamic spectrum access in wireless networks
CN103907368A (en) * 2011-12-27 2014-07-02 松下电器产业株式会社 Server device, base station device, and identification number establishment method
CN106131953A (en) * 2016-07-07 2016-11-16 上海奕行信息科技有限公司 A kind of method realizing mobile subscriber location based on frequency weighting in community in the period
CN109302714A (en) * 2018-12-07 2019-02-01 南京华苏科技有限公司 Realize that base station location is studied and judged and area covered knows method for distinguishing based on user data
CN112418445A (en) * 2020-11-09 2021-02-26 深圳市洪堡智慧餐饮科技有限公司 Intelligent site selection fusion method based on machine learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
T. BANDH ETAL.: "Automatic Site Identification and Hardware-to-Site Mapping for Base Station Self-configuration", 《2011 IEEE 73RD VEHICULAR TECHNOLOGY CONFERENCE (VTC SPRING)》 *
T. BANDH ETAL.: "Automatic Site Identification and Hardware-to-Site Mapping for Base Station Self-configuration", 《2011 IEEE 73RD VEHICULAR TECHNOLOGY CONFERENCE (VTC SPRING)》, 18 July 2011 (2011-07-18) *
王旺: "《基于机器学习的基站覆盖范围仿真》", 《电脑与电信》 *
王旺: "《基于机器学习的基站覆盖范围仿真》", 《电脑与电信》, vol. 2018, no. 11, 31 May 2019 (2019-05-31) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114401527A (en) * 2021-12-21 2022-04-26 中国电信股份有限公司 Load identification method and device of wireless network and storage medium

Also Published As

Publication number Publication date
CN112990382B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
CN104331635B (en) The method of power optical fiber Communication ray power prediction
CN109617888B (en) Abnormal flow detection method and system based on neural network
CN111148118A (en) Flow prediction and carrier turn-off method and system based on time sequence
CN112712209B (en) Reservoir warehousing flow prediction method and device, computer equipment and storage medium
CN112243249B (en) LTE new access anchor point cell parameter configuration method and device under 5G NSA networking
CN110009614A (en) Method and apparatus for output information
CN109978870A (en) Method and apparatus for output information
CN111523778A (en) Power grid operation safety assessment method based on particle swarm algorithm and gradient lifting tree
CN109787821B (en) Intelligent prediction method for large-scale mobile client traffic consumption
CN111586728B (en) Small sample characteristic-oriented heterogeneous wireless network fault detection and diagnosis method
CN113780345A (en) Small sample classification method and system facing small and medium-sized enterprises and based on tensor attention
CN112990382A (en) Base station common-site identification method based on big data
CN106570657A (en) Power grid evaluation index weight determining method
CN110290466A (en) Floor method of discrimination, device, equipment and computer storage medium
CN114169502A (en) Rainfall prediction method and device based on neural network and computer equipment
Qin et al. A wireless sensor network location algorithm based on insufficient fingerprint information
CN112541634B (en) Method and device for predicting top-layer oil temperature and discriminating false alarm and storage medium
CN113536944A (en) Distribution line inspection data identification and analysis method based on image identification
CN111343664B (en) User positioning method, device, equipment and medium
CN111738878A (en) Bridge stress detection system
CN116958806A (en) Pest identification model updating, pest identification method and device and electronic equipment
CN114066250A (en) Method, device, equipment and storage medium for measuring and calculating repair cost of power transmission project
CN115512174A (en) Anchor-frame-free target detection method applying secondary IoU loss function
CN113807462A (en) AI-based network equipment fault reason positioning method and system
CN112163613A (en) Rapid identification method for power quality disturbance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant