CN111859299A - Big data index construction method, device, equipment and storage medium - Google Patents

Big data index construction method, device, equipment and storage medium Download PDF

Info

Publication number
CN111859299A
CN111859299A CN202010714909.9A CN202010714909A CN111859299A CN 111859299 A CN111859299 A CN 111859299A CN 202010714909 A CN202010714909 A CN 202010714909A CN 111859299 A CN111859299 A CN 111859299A
Authority
CN
China
Prior art keywords
index
dimension
indexes
data
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010714909.9A
Other languages
Chinese (zh)
Inventor
陈志兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010714909.9A priority Critical patent/CN111859299A/en
Publication of CN111859299A publication Critical patent/CN111859299A/en
Priority to PCT/CN2020/131753 priority patent/WO2021139427A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Technology Law (AREA)
  • Pure & Applied Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of big data and discloses a big data index construction method, a device, equipment and a storage medium, wherein the big data index construction method comprises the following steps: acquiring and analyzing data to be predicted to construct a plurality of indexes carrying different dimension attribute information, calculating the access frequency of the indexes according to a linear regression algorithm, and judging whether other dimension tables need to be associated during index calculation to determine the types of the indexes; and inquiring a preset dimension table which corresponds to the index and is required to be associated with the storage calculation engine and the calculation index according to the corresponding relation table between the index type and the storage calculation engine and the corresponding relation table between the index type and the dimension modeling mode of the index, calling the storage calculation engine by using a route decision engine to execute the preset dimension table, and calculating a value corresponding to the index. The method solves the contradiction between the calculation time consumption and the timeliness of the index I of the large data fixed dimension, and simultaneously solves the technical problem that only a single data engine can be used and the dimension modeling can be realized.

Description

Big data index construction method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to a big data index construction method, a big data index construction device, big data index construction equipment and a big data index storage medium.
Background
With the progress of society and the development of big data, the development of the index I with fixed dimension is challenged. The fixed dimension index I is based on the fact that production data are utilized to extract corresponding production indexes, and a large amount of calculation needs to be carried out on the production data in the process. Meanwhile, very flexible calculation is required according to the division of different index levels. With the explosion of production data and the flexibility of application scenarios, the index I service of fixed dimensionality cannot provide effective service any more.
Previous solutions have been to solve the problem by providing more computing resources or a higher computing engine, but at the surge of data volume, significant resources are also consumed. In the aspect of a calculation model, although flexibility of index calculation is realized, time consumption of index calculation is increased, and in order to meet the consistency of the calculation model and calculation resources, index I calculation of a fixed dimension of big data is limited to a single calculation engine.
Disclosure of Invention
The invention mainly aims to solve the technical problem that only a single data engine and dimension modeling can be used.
The invention provides a big data index construction method in a first aspect, which comprises the following steps:
acquiring data to be predicted;
analyzing the data to be predicted, and constructing a plurality of indexes carrying different dimension attribute information;
calculating the access frequency of the index according to a linear regression algorithm, and judging whether the index is associated with a preset dimension table or not;
determining an index type of the index based on the access frequency, wherein the index type comprises an index of multi-dimensional aggregation and an index of a fixed dimension;
based on the index type, determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and the storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index;
determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
and calling the storage calculation engine by using a routing decision engine to execute the preset dimension table, and calculating an index value corresponding to the index.
Optionally, in a first implementation manner of the first aspect of the present invention, the analyzing the data to be predicted to construct multiple indexes carrying different dimensional attribute information includes:
analyzing the data to be predicted, and defining a plurality of indexes;
grading the indexes by using a preset model, and adding dimension attributes;
and combining the indexes and the dimension attributes based on the indexes and the dimension attributes to obtain indexes of a plurality of different dimension attributes.
Optionally, in a second implementation manner of the first aspect of the present invention, before the calculating, according to a linear regression algorithm, an access frequency of the index, and determining whether the index is associated with a preset dimension table, the method further includes:
acquiring historical data containing the indexes, wherein the historical data comprises the indexes in a specific period, the access times of the indexes in the specific period, and index factors influencing the access times of the indexes in the specific period;
taking the historical data as sample data, performing partial correlation analysis on the sample data, extracting indexes, and respectively establishing a mapping relation equation of the indexes and corresponding index factors;
and respectively carrying out T test on the mapping relation equations to determine main index factors influencing the index access frequency.
Optionally, in a third implementation manner of the first aspect of the present invention, the calculating, according to a linear regression algorithm, an access frequency of the index, and determining whether the index is associated with a preset dimension table includes:
determining main index factors influencing the index access frequency based on a linear regression algorithm;
establishing a mapping relation equation of the index and the main index factor, and predicting a parameter value of the main index factor by adopting an elastic coefficient method;
and substituting the parameter value of the index factor into the mapping relation equation, and calculating the access frequency of the index.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the determining, based on the access frequency, an index type of the index includes:
if the access frequency of the index is greater than a preset threshold value and other dimension tables need to be associated when the access frequency of the index is calculated, the index is an index type needing multi-dimensional aggregation;
and if the access frequency of the index is greater than a preset threshold value and the access frequency of the index does not need to be associated with other dimension tables when the index is calculated, the index type is the index type with fixed dimensions.
Optionally, in a fifth implementation manner of the first aspect of the present invention, after the determining, based on the access frequency, an index type of the index, the method further includes:
inquiring a model construction method corresponding to the index from a corresponding relation table between a preset index type and the model construction method based on the type of the index;
if the index is an index type index needing multi-dimensional polymerization, constructing a random report and/or a semi-polymerization report by using dimensional modeling, and storing the random report and/or the semi-polymerization report to a non-polymerization engine and/or a semi-polymerization engine;
and if the index is an index type index with fixed dimensionality, modeling by using a wide table, constructing an aggregated report, and storing the aggregated report to an aggregation engine.
Optionally, in a sixth implementation manner of the first aspect of the present invention, after the querying, based on the type of the index, the storage computation engine corresponding to the index and the preset dimension table required to be associated to compute the index according to a preset correspondence table between the index type and the storage computation engine and a correspondence table between the index type and a dimension modeling manner of the index, where the preset dimension table includes a dimension table constructed based on the dimension modeling manner corresponding to the index type or a dimension table constructed based on all the dimension modeling manners, the method further includes:
if the index is the index needing multi-dimensional polymerization, degrading the index and storing the index into a random report or a semi-polymerization report;
inquiring a storage calculation engine corresponding to the index needing multidimensional polymerization and a preset dimension table needing to be associated, and determining the dimension table needing to be associated when the index type index needing multidimensional polymerization is calculated;
if the index is the index of the fixed dimension, storing all fields in the dimension to an aggregated report by utilizing wide table modeling;
and querying a storage calculation engine corresponding to the index type index of the fixed dimension, and storing the aggregated report to an aggregation engine.
The second aspect of the present invention provides a big data index constructing apparatus, including:
the first acquisition module is used for acquiring data to be predicted;
the first construction module is used for analyzing the data to be predicted so as to construct a plurality of indexes carrying different dimension attribute information;
the judging module is used for calculating the access frequency of the index according to a linear regression algorithm and judging whether the index is associated with a preset dimension table or not;
a first determination module, configured to determine an index type of the index based on the access frequency, where the index type includes an index of multidimensional aggregation and an index of fixed dimension;
the second determination module is used for determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and a storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index based on the index type;
the third determining module is used for determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
and the calculation module is used for calling the storage calculation engine to execute the preset dimension table by utilizing a routing decision engine, and calculating the index value corresponding to the index.
Optionally, in a first implementation manner of the second aspect of the present invention, the first building module is specifically configured to:
analyzing the data to be predicted, and defining a plurality of indexes;
grading the indexes by using a preset model, and adding dimension attributes;
and combining the indexes and the dimension attributes based on the indexes and the dimension attributes to obtain indexes of a plurality of different dimension attributes.
Optionally, in a second implementation manner of the second aspect of the present invention, the big data index constructing apparatus further includes:
the second acquisition module is used for acquiring historical data containing the indexes, wherein the historical data comprises the indexes in a specific period, the access times of the indexes in the specific period, and index factors influencing the access times of the indexes in the specific period;
the analysis module is used for taking the historical data as sample data, performing partial correlation analysis on the sample data, extracting indexes, and respectively establishing a mapping relation equation of the indexes and corresponding index factors;
and the checking module is used for respectively carrying out T checking on the mapping relation equation and determining main index factors influencing the index access frequency.
Optionally, in a third implementation manner of the second aspect of the present invention, the determining module is specifically configured to:
determining main index factors influencing the index access frequency based on a linear regression algorithm;
establishing a mapping relation equation of the index and the main index factor, and predicting a parameter value of the main index factor by adopting an elastic coefficient method;
and substituting the parameter value of the index factor into the mapping relation equation, and calculating the access frequency of the index.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the first determining module is specifically configured to:
if the access frequency of the index is greater than a preset threshold value and other dimension tables need to be associated when the access frequency of the index is calculated, the index is an index type needing multi-dimensional aggregation;
and if the access frequency of the index is greater than a preset threshold value and the access frequency of the index does not need to be associated with other dimension tables when the index is calculated, the index type is the index type with fixed dimensions.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the big data index constructing apparatus further includes:
the second query module is used for querying the model construction method corresponding to the index from a corresponding relation table between the preset index type and the model construction method based on the type of the index;
the second construction module is used for constructing a random report and/or a semi-aggregated report by using dimensional modeling when the index is an index type index needing multi-dimensional aggregation, and storing the random report and/or the semi-aggregated report to a non-aggregation engine and/or a semi-aggregation engine;
and the first storage module is used for modeling by using a wide table if the index is an index type index with fixed dimensionality, constructing an aggregated report and storing the aggregated report to an aggregation engine.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the big data index constructing apparatus further includes:
the index degradation module is used for degrading the index when the index is the index needing multi-dimensional polymerization and storing the index into a random report or a semi-polymerization report;
a fourth determining module, configured to query a storage calculation engine corresponding to the index requiring multidimensional aggregation and a preset dimension table required to be associated, and determine a dimension table required to be associated when the index type index requiring multidimensional aggregation is calculated;
the second storage module is used for storing all fields in the dimension to the aggregated report by utilizing wide table modeling when the index is the index of the fixed dimension;
and the third storage module is used for inquiring a storage calculation engine corresponding to the index type index of the fixed dimension and storing the aggregated report to an aggregation engine.
The third aspect of the present invention provides a big data index constructing apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the big data index construction apparatus to perform the big data index construction method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned big data index construction method.
According to the technical scheme provided by the invention, the data to be predicted are mainly obtained and analyzed to construct indexes with multiple dimensional attributes, and the calculation requirements of the indexes are determined according to the access frequency of the indexes predicted by a linear regression algorithm. And selecting a proper mode to store the indexes to the corresponding storage calculation engines according to the calculation requirements of the indexes, calculating the index values of the indexes, solving the contradiction between the calculation time consumption and the timeliness of the index I of the big data with fixed dimensionality, and simultaneously solving the technical problem that only a single data engine and dimensionality modeling can be used.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a big data index construction method according to an embodiment of the present invention;
FIG. 2 is a diagram of a second embodiment of a big data index construction method according to an embodiment of the present invention;
FIG. 3 is a diagram of a third embodiment of a big data index construction method according to an embodiment of the present invention;
FIG. 4 is a diagram of a first embodiment of a big data index building apparatus according to an embodiment of the present invention;
FIG. 5 is a diagram of a second embodiment of a big data index building apparatus according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an embodiment of a big data index building device in the embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a big data index construction method, a big data index construction device, big data index construction equipment and a big data index storage medium. According to the calculation requirement of the index, an appropriate mode is selected to store the index to the corresponding storage calculation engine, the index value of the index is calculated, the contradiction between the calculation time consumption and the timeliness of the index I of the big data fixed dimension is solved, and the technical problem that only a single data engine and dimension modeling can be used is solved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and with reference to fig. 1, a first embodiment of a big data index constructing method according to the embodiment of the present invention includes:
101. acquiring data to be predicted;
in this embodiment, all data to be predicted, which includes a plurality of index tags, is obtained. Such as the premium of a certain risk under a certain activity, the premium of all risk under a certain activity, etc.
In this embodiment, the data to be predicted refers to data including an index to be calculated, the data is analyzed to determine index information included in the data, and further, a common attribute is added to construct tags of multiple (different) basic dimensions (under the attribute). For example, we take the index "premium" as an example, and add the common attribute of the index, so that a plurality of different indexes "premium of car insurance", "premium under double 11 activities", or "premium of car insurance under double 11 activities" can be constructed "
102. Analyzing the data to be predicted to construct a plurality of indexes carrying different dimension attribute information;
in this embodiment, the index is an indication label that the enterprise has obtained according to data analysis.
In this embodiment, the basic dimension is a dimension index calculation value added with a basic attribute of a company, and the addition of a common attribute is an attribute shared by companies.
103. Calculating the access frequency of the index according to a linear regression algorithm, and judging whether the index is associated with a preset dimension table or not;
in this embodiment, if it is determined whether the index needs to be associated with other dimensional data during calculation, the access frequency of the index needs to be predicted first, that is, whether the index is an index that needs to be counted (accessed) frequently or an index counted by a certain usage rule is determined, the calculation requirement of the index is determined according to the access frequency of the index and whether other data needs to be associated during calculation of the index, and a suitable storage calculation engine is further selected.
The dimension table in this embodiment can be understood as a data table containing information of a plurality of indexes (tags) to a certain extent. For example, a dimension table for statistics of xx insurance company "total amount of premium in 2019" includes: the label contained in this dimension table has "time: 2019.01, 2019.02, · · 2019.12, dangerous species: the car insurance, the life insurance, the serious insurance and the children insurance comprise indexes of 2019.01 car insurance premium, 2019.01 serious insurance premium, 2019.03 life insurance premium and 2019.03 children insurance premium.
In this embodiment, whether a certain index needs to be associated with another dimension table during calculation is determined by whether an index in another dimension table (data table) needs to be introduced during calculation of the index. For example, when the index of "total amount of the vehicle insurance premium in 2017 to 2019" is calculated, only the index information of "total amount of the vehicle insurance premium in 2018" is included in the local dimension table, and at this time, if the index of "total amount of the vehicle insurance premium in 2017 to 2019" is calculated, the index information in the dimension table of "total amount of the vehicle insurance premium in 2017" and "total amount of the vehicle insurance premium in 2019" need to be associated at the same time for calculation. For another example, when the index of "the total of the vehicle insurance premium in the months of 4 to 6 in 2019" is calculated, since the total of the vehicle insurance premium in the month of 2019 "in the local dimension table includes the months of 1 to 12 in the year of 2019, the vehicle insurance premium in each month does not need to be calculated by associating the index information in other dimension tables.
Linear regression in this embodiment refers to a regression analysis that models the relationship between one or more independent and dependent variables using a least squares function called the linear regression equation. Such a function is a linear combination of one or more model parameters called regression coefficients. The case of only one independent variable is called simple regression, and the case of more than one independent variable is called multiple regression; in linear regression, data is modeled using linear prediction functions, and unknown model parameters are also estimated from the data, these models being called linear models.
In this embodiment, linear regression is mainly used for prediction or mapping, and linear regression can be used to fit a prediction model to the sum X of the observed data sets. When such a model is completed, for a new value of X, a value of y can be predicted using the fitted model without given the y with which it is paired.
In this embodiment, the access frequency of the index is predicted by using a linear regression algorithm. For example, in the process of using the indexes, it is found that some indexes are frequently used or the access frequency of the indexes is affected by some other data, the indexes have the same characteristics and are linear, and we can guess which indexes have higher access frequency or are counted by rules according to the characteristics. Such as: with one activity, indexes needing to be checked by the double-11 are also needed to be counted by the double-12, and indexes of the double-12 can be calculated and aggregated in advance according to the characteristics of the double-11.
In this embodiment, regression is to predict new data, such as stock trends, based on existing data. Linear regression is a method that can describe the relationship between data more accurately with a straight line, so that when new data appears, a simple value can be predicted.
The linear regression model is as follows:
h(x)=w1x1+w2x2+w3x3+...+wnxn+b
the model resulting from linear regression is not necessarily a straight line:
(1) when there is only one variable, the model is a straight line in the plane;
(2) when there are two variables, the model is a plane in space;
(3) with more variables, the model will be higher dimensional.
The linear regression model has good interpretability, and the influence degree of each feature (here, the feature, namely, the index factor which influences the access frequency of the index in the following) on the result can be directly seen from the weight W. Linear regression is applied to data sets where there is a linear relationship between X and y, and a scatter plot can be drawn using computer assistance to see if there is a linear relationship. We try to fit the data using a straight line, minimizing the sum of the distances of all points to the straight line.
In practice, the linear regression usually uses the sum of the squares of the residuals, i.e. the distance from a point to a straight line parallel to the y-axis instead of the perpendicular distance, and the sum of the squares of the residuals divided by the sample size n is the mean square error. The mean square error is used as a loss function (cost function) of the linear regression model. Minimizing the sum of the distances from all points to the straight line minimizes the mean square error, which is called least squares.
Loss function formula:
Figure BDA0002597812570000071
because h (x) ═ w1x1+w2x2+w3x3+...+wnxn+b
Finally, solving to obtain the calculation formulas of w and b as follows:
Figure BDA0002597812570000072
in this embodiment, when predicting the access frequency of the indicator a in certain sample data, assuming that the input data set D has n samples and D features, then:
D={(x(1),y1),(x(2),y2),...,(x(n),yn)}
where the ith sample is represented as:
(x(i),yi)=(x1 (i),x2 (i),...,xd (i),yi)
the linear model predicts by building a linear combination. Our hypothetical function (1) is:
Hθ(x1,x2,...,xd)=θ01x12x2+...+θdxd
where θ 0 and θ 1.. θ d are model parameters, let X0 be 1, and X (i) (X1(i), X2(i),. and xd (i)) be row vectors, let X be an n × d matrix, and θ be a d 1-dimensional vector, then assuming that the function (1) can be expressed as: h θ (X) ═ X θ
Figure BDA0002597812570000081
The loss function being the mean square error, i.e.
Figure BDA0002597812570000082
The least square method solves the parameters, and the loss function J (theta) differentiates theta:
Figure BDA0002597812570000083
order to
Figure BDA0002597812570000084
De θ ═ XTX)-1XTY
In this embodiment, a linear regression algorithm is used to determine important indexes in the sample data, and a mapping relation equation between the indexes and index factors affecting the access frequency of the indexes is respectively established. Determining each main dependent variable (i.e., main index factor) a according to the weight W of all index factors having influence on the access frequency of the index M1,a2,a3,...,an. Respectively establishing a mapping relation equation of each main index factor and the index: y ═ beta + beta a1+βa2+…+βan
Where y is the access frequency of the index M (in a certain time period), a1,a2,a3,...,anFor influencing the access frequency of the index M (within a certain time period)There is an index factor. Taking the index M as an example, the access frequency of the index M and the index factor a influencing the access frequency of the index M in the promotion activities of 2017-2019 are collected1,a2,a3,...,anThe numerical value of (c). Inputting the data by using SPSS tool, wherein the equation is y ═ beta a1+βa2+βa3+...+βanSince the correlation coefficient of the index factor and the adjusted multiple determination coefficient are very close to 1, the model has better goodness of fit, and the linear relation of the model is more obvious. Based on the F test, see a1,a2,a3,...,anThe method comprises the steps of obtaining comparison between each predicted value and an actual value of an index M for main index factors through python drawing, obtaining that the model is reasonable through comparison and can be used for predicting the access frequency of the index M (in a certain specific time period), repeating the operation, establishing a unary linear regression model of each main index factor and the access frequency of the index M (in a certain specific time period), estimating the change condition of the access frequency of the index M in a certain specific time period, inputting the change condition into the prediction model as input data, and finally obtaining the predicted access frequency of the index M (in a certain specific time period). Further, according to the prediction method of linear regression, data can be predicted, which data will be accessed with high frequency is predicted, the data with high frequency access needs pre-polymerization, and other storage engines can be used if the data does not need high frequency access.
104. Determining an index type of the index based on the access frequency, wherein the index type comprises an index of multi-dimensional aggregation and an index of a fixed dimension;
in this embodiment, the type of the index is determined according to the access frequency of the index and whether other dimension tables need to be associated when the index is calculated, and further, the type of the index is determined. For example, some of the indicators need to be associated with a plurality of dimension tables to be calculated, and other indicators need not be associated with other dimension tables to be calculated to calculate the value of the indicator.
In this embodiment, the index includes two types, namely an index type requiring multidimensional aggregation, that is, an index requiring association calculation with other dimension tables during calculation, and an index type with fixed dimensions, that is, an index requiring calculation with only data in a wide table to which the index type belongs, without associating data in other dimension tables during calculation.
105. Based on the index type, determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and the storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index;
in this embodiment, according to the type of the index, the storage calculation engine corresponding to the index and information of the preset dimension table required to be associated for calculating the index are queried from a corresponding relationship table between a preset index type and the storage calculation engine.
In this embodiment, the storage calculation engines for storing different types of indexes have different positions, for example, a part of the storage calculation engines is stored in a random report or a semi-aggregated report, indexes of other dimension tables need to be associated during calculation, when the indexes are queried, the values of the indexes can be calculated only after the tables where the indexes are located are associated with the other dimension tables, and the indexes with fixed dimensions do not need to be associated with the other dimension tables during calculation, so that the aggregated report constructed by the indexes can be stored in the aggregation engine for calculation in advance.
In this embodiment, it is determined whether to perform correlation (calculation) on multiple dimension tables when querying the index (value) according to the type of the index, and if necessary, query the corresponding dimension table. For example, when calculating the index value of the fixed index "premium of car insurance in 2018 double 11 activities" only needs to store data in three tables with different dimensions, namely a wide table, of the table "premium of car insurance in 2018", the table "premium of car insurance in 2018" and "premium of double 11 activities in 2018" in one table, no other data report needs to be associated during calculation, and when calculating the index "premium of 2018", the table "premium of car insurance in 2018", the table "premium of property in 2018", and the table "premium of life insurance in 2018". the table "premium of XX insurance in 2018", and tables of all premium types are associated together, so that the index value of the index "premium of 2018" can be obtained.
106. Determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes:
in this embodiment, different types of reports are generated by different types of indexes corresponding to different modeling models, and the generated reports are also stored in different data storage calculation engines according to different report types.
107. And calling the storage calculation engine by using a routing decision engine to execute the preset dimension table, and calculating an index value corresponding to the index.
In this embodiment, the routing decision engine requests to direct to the corresponding storage computation engine according to the corresponding relationship of the computation engine stored in the report to which the index belongs, that is, according to different indexes of the query, the routing decision engine selects the storage computation engine corresponding to the current computation request, and distributes the request to the corresponding storage computation engine to compute the value of the corresponding index. For example, if the viewed index is a basic (fixed) index, the query (calculation) request is forwarded to a database based on hive (no aggregated database, which can implement multi-table association calculation), and if the pre-calculated index is to be viewed, the query (calculation) request is forwarded to a database based on dry.
In this embodiment, the calculation requirement of the index may be simply understood as whether association and calculation of the dimension table are required (presence or absence).
According to the technical scheme provided by the invention, the data to be predicted are mainly obtained and analyzed to construct indexes with multiple dimensional attributes, and the calculation requirements of the indexes are determined according to the access frequency of the indexes predicted by a linear regression algorithm. According to the calculation requirement of the index, an appropriate mode is selected to store the index to the corresponding storage calculation engine, the index value of the index is calculated, the contradiction between the calculation time consumption and the timeliness of the index I of the big data fixed dimension is solved, and the technical problem that only a single data engine and dimension modeling can be used is solved.
Referring to fig. 2, a second embodiment of the big data index constructing method according to the embodiment of the present invention includes:
201. acquiring data to be predicted;
202. analyzing the data to be predicted, and defining a plurality of indexes;
in this embodiment, the acquired data including a plurality of index tags is parsed, and a plurality of definable tags are acquired therefrom. Such as "premium", "premium for life insurance", "premium for property insurance under double 12 activities", "premium for car insurance under double 11 activities", etc.
In this embodiment, the index refers to a unit or method for measuring the development degree of things, and IT also refers to a common name in IT, that is, a measure. For example: population, GDP, revenue, number of users, profit margin, retention, coverage, etc. Many companies have own KPI index system, which is to measure the quality of the company business operation condition through several key indexes. The indexes need to be obtained through a summarizing calculation mode such as summing, averaging and the like, and summarizing calculation needs to be carried out under certain precondition, such as time, place and cost, namely what we often say, statistical caliber and range.
203. Grading the indexes by using a preset model, and adding dimension attributes;
in this embodiment, the extracted indexes are classified by using a preset model, and the dimension attribute information of each index is increased, and taking "premium" as an example, the dimension attribute information of the index "premium" is gradually increased, so that the "enterprise planning premium" and the "enterprise planning premium" of a second-level organization can be further changed, the basic dimension attribute information of the index is increased, and meanwhile, the common attribute dimension information is increased, for example, whether the index is the enterprise planning premium of the second-level organization participating in the insuring activity is increased.
In this embodiment, the dimension attribute of the index refers to a certain feature of the object or phenomenon, such as gender, region, time, and the like. The time is a common and special dimension, and through comparison before and after the time, the development of things can be known to be good or bad, for example, the time comparison is called the vertical ratio, namely the premium of the car insurance under the action of double 11 in 2019 is 10% higher than the premium of the car insurance under the action of double 11 in 2018, the premium of the life insurance under the action of double 12 in 2019 is 20% higher than the premium of the life insurance under the action of double 11 in 2019. Another comparison is a cross-ratio, such as a comparison between the different risk categories of "premium for car insurance in 2018 double 11 activities" and "premium for life insurance in 2018 double 11 activities", which is a comparison between the same level units, referred to as a cross-ratio.
In this embodiment, the dimensions may be divided into qualitative dimensions and quantitative dimensions, that is, the dimensions are divided according to data types, where the data types are character-type (text-type) data, that is, qualitative dimensions, such as region and gender, are both qualitative dimensions; the data type is numerical data, that is, quantitative dimensions such as income, age, consumption, and the like.
204. Combining the index and the dimension attribute based on the index and the dimension attribute to obtain indexes of a plurality of different dimension attributes;
in this embodiment, according to the indexes and the dimension attribute information, the indexes and the dimension attribute are combined to obtain a plurality of indexes carrying different dimension attribute information. Such as "premium of car insurance in the two 11 activities in 2019", "premium of car insurance in the two 12 activities in 2019", "premium of property insurance in the two 11 activities in 2019", and premium of property insurance in the two 12 activities in 2019 ".
205. Determining main index factors influencing the index access frequency based on a linear regression algorithm;
in this embodiment, according to a linear regression algorithm, the indexes of different dimensional attributes in the data to be predicted are determined, and at the same time, the index factors that affect the index access frequency are determined.
206. Establishing a mapping relation equation of the index and the main index factor, and predicting a parameter value of the main index factor by adopting an elastic coefficient method;
in this embodiment, a mapping equation between an index obtained from data to be predicted and an index factor corresponding to the index is established.
In this embodiment, the elastic coefficient method is used to predict the parameter values of index factors of the data to be predicted under a certain specific activity. For example, the number of people who purchased car insurance in the month of the 2019 double 11 activities. And (in the collected historical data), calculating the elastic coefficient ET by adopting the data of the latest year and the farthest year, and calculating the access frequency of the data to be predicted corresponding to the index under a certain specific activity.
The access frequency in this embodiment can also be said to be a probability value.
207. Substituting the parameter value of the index factor into the mapping relation equation, and calculating the access frequency of the index;
in this embodiment, a mapping equation between the index obtained from the data to be predicted and the index factor corresponding to the index is established, and the parameter value of the index factor is substituted into the mapping equation, so that the access frequency (probability value) of the index can be calculated (predicted).
208. If the access frequency of the index is greater than a preset threshold value and other dimension tables need to be associated when the access frequency of the index is calculated, the index is an index type needing multi-dimensional aggregation;
in this embodiment, if the access probability of the index is greater than the preset threshold and the index is queried (calculated), other dimension tables need to be associated for performing association calculation, and then the index may be determined to be an index type index that needs multidimensional aggregation, that is, an index that needs multidimensional aggregation. For example, the index "premium in 2018", which is an index type index requiring multidimensional polymerization, that is, an index requiring multidimensional polymerization, needs to be calculated by associating the table "premium for car insurance in 2018", the table "premium for property insurance in 2018", and the table "premium for life insurance in 2018".
209. If the access frequency of the index is greater than a preset threshold value and other dimension tables do not need to be associated when the access frequency of the index is calculated, the index type is an index type with fixed dimensions;
in this embodiment, if the access probability of the index is greater than the preset threshold and the index is queried (calculated), it is not necessary to associate other dimension tables for performing association calculation, and only data in a table to which the index belongs is used, so that the index can be determined to be an index type index with a fixed dimension, that is, a fixed (index). For example, the dimension of the index is three fixed dimensions of "premium for car insurance under double 11 activities in 2018", "double 11 activities in 2018 + car insurance", when calculating the index "premium for car insurance under double 11 activities in 2018", only three tables with different dimensions of "premium for car insurance in 2018", "premium for car insurance in 2018" and "premium under double 11 activities in 2018" are modeled by using a wide table and stored in the same table, that is, the wide table, when calculating, only the data in the (wide) table is queried, and data in other tables do not need to be associated, so that the index "premium for car insurance under double 11 activities in 2018" is an index type index with fixed dimension, that is, a fixed index.
In this embodiment, the wide table is a table in which all fields are established, and other tables do not need to be associated when calculating the statistical data (calculating the index value).
210. Based on the index type, determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and the storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index;
211. determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
212. and calling the storage calculation engine by using a routing decision engine to execute the preset dimension table, and calculating an index value corresponding to the index.
According to the technical scheme provided by the invention, the data to be predicted are mainly obtained and analyzed to construct indexes with multiple dimensional attributes, and the calculation requirements of the indexes are determined according to the access frequency of the indexes predicted by a linear regression algorithm. According to the calculation requirement of the index, an appropriate mode is selected to store the index to the corresponding storage calculation engine, the index value of the index is calculated, the contradiction between the calculation time consumption and the timeliness of the index I of the big data fixed dimension is solved, and the technical problem that only a single data engine and dimension modeling can be used is solved.
Referring to fig. 3, a third embodiment of the big data index constructing method according to the embodiment of the present invention includes:
301. acquiring data to be predicted;
302. analyzing the data to be predicted to construct a plurality of indexes carrying different dimension attribute information;
303. acquiring historical data containing the indexes, wherein the historical data comprises the indexes in a specific period, the access times of the indexes in the specific period, and index factors influencing the access times of the indexes in the specific period;
in this embodiment, historical data including an index to be predicted is obtained, for example, a basic rule of the index "car insurance premium for activities in two 11 years in 2019" is to be roughly understood, data information including the index "car insurance premium for activities in two 11 years in 2018" needs to be obtained, and the data information is analyzed to predict the index "car insurance premium for activities in two 11 years in 2019", so that the index in a specified period of goods reporting, the number of times of visits (access frequency) of the index in the specified period, and an index factor which may influence the number of times of visits of the index in the specified period are obtained in the historical data in this embodiment. The index factor is related to the number of accesses to the index within a particular period of time, thus establishing a mapping between the index factor and the index access frequency, and calculating (or "predicting") the index access frequency from historical data.
304. Taking the historical data as sample data, performing partial correlation analysis on the sample data, extracting indexes, and respectively establishing a mapping relation equation of the indexes and corresponding index factors;
in this embodiment, the history data is used as sample data, for example, data information of "the insurance premium for the double 11 activities in 2018" is used as the sample data.
In this embodiment, the index (label) extracted from the sample data may be referred to as a dependent variable, and in a multiple linear regression equation with multiple dependent variables, the relationship between multiple related variables (any two dependent variables have a certain correlation, and therefore in the partial correlation analysis, the dependent variable is referred to as a related variable) is complex, and any two related variables often have simple correlation relationships of different degrees, but the correlation relationship includes the influence of other related variables. Therefore, simple correlation analysis (i.e. linear correlation analysis) does not consider the influence of other correlation variables on the two correlation variables, and actually does not truly reflect the correlation between the two correlation variables. And only after the influence of other related variables is eliminated, the correlation between two related variables can be studied, and the nature and the closeness degree of the correlation between the two related variables can be really reflected. The partial correlation analysis is a statistical analysis method for studying the correlation between two related variables by fixing other related variables.
In this embodiment, in the mapping equation between the index and the corresponding index factor, the index factor is an independent variable, and the index is a dependent variable.
In this embodiment, a stepwise regression method is used to establish a mapping equation between an index and an index factor corresponding to the index, an index factor value corresponding to each index in collected sample data (historical data) is input, and an SPSS modeling tool is used to complete the establishment of the mapping equation. When the mapping relation equation of the index and the gas index factor is established, only the index factor parameter value in the collected historical data needs to be input, the requirement on sample data is low, and the defect that the requirement on the sample data is high in a topological model prediction method in a gray prediction model is overcome. Meanwhile, the corresponding mapping relation equation coefficient of each dangerous type (corresponding to the index) can be obtained according to different index factor parameter values of each dangerous type (corresponding to the index), the method is suitable for the variability of the index factors of different dangerous types in different periods, and the adaptability is strong.
305. Respectively carrying out T test on the mapping relation equations to determine main index factors influencing the index access frequency;
in this embodiment, the t test is one of significance tests in the multiple linear regression algorithm, and the F test may be equivalent to the t test under the common two-times method.
In this embodiment, the mapping relation equation of each index and the index factor is further analyzed by a partial correlation analysis method, a main independent variable in the mapping relation between each index and the index factor is determined (that is, the main index factor has many index factors that affect the number of times that the index is accessed in a specific period, and the main index factor is a main influence factor), and then all the main index factors are retained in the mapping relation equation of the index and the index factor. The partial correlation coefficient value is within a preset value interval, and the index factor of which the regression coefficient is greater than the F test parameter or the t test parameter in the mapping relation equation is taken as a main index factor.
306. Calculating the access frequency of the index according to a linear regression algorithm, and judging whether the index is associated with a preset dimension table or not;
307. determining an index type of the index based on the access frequency, wherein the index type comprises an index of multi-dimensional aggregation and an index of a fixed dimension;
308. inquiring a model construction method corresponding to the index from a corresponding relation table between a preset index type and the model construction method based on the type of the index;
in this embodiment, according to the type of the index, a model construction method corresponding to the index type is queried from a preset correspondence table between the index type and the model construction method.
In this embodiment, for the index that needs to be associated with other dimension tables for calculation, dimension modeling is used, while for the requirement of fixed dimension, wide table modeling is used, that is, all fields are built in one table, and other tables do not need to be associated when counting data.
If the index is an index type index needing multi-dimensional polymerization, constructing a random report and/or a semi-polymerization report by using dimensional modeling, and storing the random report and/or the semi-polymerization report to a non-polymerization engine and/or a semi-polymerization engine;
in this embodiment, if the to-be-calculated index is an index type index requiring multidimensional aggregation, that is, an index that can be calculated only by associating with a plurality of dimension tables, the dimension modeling is used to construct a stochastic report and/or a semi-aggregated report, and the stochastic report and/or the semi-aggregated report are stored in a non-aggregation engine and/or a semi-aggregation engine.
If the index is an index type index with fixed dimensionality, modeling by using a wide table, constructing a polymerization report, and storing the polymerization report to a polymerization engine;
in this embodiment, if the index to be calculated is an index type index of a fixed dimension, that is, an index that can be calculated without being associated with multiple dimension tables, a broad-table modeling is used to construct an aggregated report, and the aggregated report is stored in an aggregation engine.
In this embodiment, the wide table modeling is to store indexes and dimensions in a large table, that is, data is divided into a fact table and a dimension table, the fact table is a record of a specific event, all fields are built in the fact table, and other tables do not need to be associated when data is counted. The dimension represents some descriptions of events, and through separation of facts and dimension tables, flexibility is improved, and corresponding problems are solved. For example, the index "premium of car insurance under double 11 activities in 2018", and all fields (index and dimension) of the fields "premium of car insurance under double 11 activities in 2018" are stored in a large table, that is, a wide table, that is, an aggregated report. And further, storing the aggregated report to a corresponding aggregation engine.
309. Based on the index type, determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and the storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index;
310. determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
if the index is the index needing multi-dimensional polymerization, degrading the index and storing the index into a random report or a semi-polymerization report;
in this embodiment, if the index is an index type index that requires multidimensional aggregation, it can be understood that the index does not need to be calculated in advance, and the index is degraded, that is, the index and data are stored in a common calculation engine, so that calculation resources are saved. Common compute engines include non-aggregation engines and semi-aggregation engines.
Inquiring a storage calculation engine corresponding to the index needing multidimensional polymerization and a preset dimension table needing to be associated, and determining the dimension table needing to be associated when the index type index needing multidimensional polymerization is calculated;
in this embodiment, a storage calculation engine corresponding to the index and a dimension table that needs to be associated when the index is calculated are queried, and the dimension table that needs to be associated when the index is calculated is determined. If the index is an index type index requiring multi-dimensional aggregation, namely, other dimension tables need to be associated when the index is calculated, determining a data table which the index needs to be associated when the index is calculated. For example, if the index is "premium in 2018", then other dimension tables that need to be associated with the index are calculated, and these dimension tables are stored in corresponding non-aggregation or semi-aggregation storage calculation engines, such as a table of premium of car insurance in 2018, a table of premium of property insurance in 2018, and a table of premium of life insurance in 2018.
If the index is the index of the fixed dimension, storing all fields in the dimension to an aggregated report by utilizing wide table modeling;
in this embodiment, if the index is an index of a fixed dimension, that is, no other dimension table is needed to perform aggregate calculation when calculating the index, the index of this type may be stored in the aggregate report and calculated in advance. The index query (calculation) time is saved, and the data processing efficiency is improved.
Inquiring a storage calculation engine corresponding to the index type index of the fixed dimension, and storing the aggregated report to an aggregation engine;
in this embodiment, if the index is an index type index with a fixed dimension, that is, other dimension tables do not need to be associated when the index is calculated, the index type is stored in the aggregated report through wide table modeling and is stored in the aggregation engine, so that advanced calculation is realized.
311. And calling the storage calculation engine by using a routing decision engine to execute the preset dimension table, and calculating an index value corresponding to the index.
According to the technical scheme provided by the invention, the data to be predicted are mainly obtained and analyzed to construct indexes with multiple dimensional attributes, and the calculation requirements of the indexes are determined according to the access frequency of the indexes predicted by a linear regression algorithm. According to the calculation requirement of the index, an appropriate mode is selected to store the index to the corresponding storage calculation engine, the index value of the index is calculated, the contradiction between the calculation time consumption and the timeliness of the index I of the big data fixed dimension is solved, and the technical problem that only a single data engine and dimension modeling can be used is solved.
In the above description of the big data index construction method in the embodiment of the present invention, the big data index construction apparatus in the embodiment of the present invention is described below with reference to fig. 4, and an embodiment of the big data index construction apparatus in the embodiment of the present invention includes:
a first obtaining module 401, configured to obtain data to be predicted;
a first constructing module 402, configured to analyze the data to be predicted to construct a plurality of indexes carrying different dimensional attribute information;
the judging module 403 is configured to calculate an access frequency of the index according to a linear regression algorithm, and judge whether the index is associated with a preset dimension table;
a first determining module 404, configured to determine an index type of the index based on the access frequency, where the index type includes an index of multidimensional aggregation and an index of fixed dimension;
a second determining module 405, configured to determine, based on the index type, a storage calculation engine and a dimensional modeling manner corresponding to the index according to a correspondence table between a preset index type and the storage calculation engine and a correspondence table between the index type and the dimensional modeling manner of the index;
a third determining module 406, configured to determine a preset dimension table associated with the indicator according to the dimension modeling manner, where the preset dimension table includes a dimension table constructed based on a dimension modeling manner corresponding to the indicator type or a dimension table constructed based on all dimension modeling manners;
and the calculating module 407 is configured to invoke the storage calculation engine to execute the preset dimension table by using a routing decision engine, and calculate an index value corresponding to the index.
In the embodiment, the data to be predicted is mainly obtained and analyzed to construct indexes with multiple dimensional attributes, and the calculation requirement of the indexes is determined according to the access frequency of the indexes predicted by the linear regression algorithm. According to the calculation requirement of the index, an appropriate mode is selected to store the index to the corresponding storage calculation engine, the index value of the index is calculated, the contradiction between the calculation time consumption and the timeliness of the index I of the big data fixed dimension is solved, and the technical problem that only a single data engine and dimension modeling can be used is solved.
Referring to fig. 5, a second embodiment of the big data pointer constructing apparatus according to the embodiment of the present invention includes:
a first obtaining module 501, configured to obtain data to be predicted;
a first construction module 502, configured to analyze the data to be predicted to construct a plurality of indexes carrying different dimensional attribute information;
the judging module 503 is configured to calculate an access frequency of the index according to a linear regression algorithm, and judge whether the index is associated with a preset dimension table;
a first determining module 504, configured to determine an index type of the index based on the access frequency, where the index type includes an index of multidimensional aggregation and an index of fixed dimension;
a second determining module 505, configured to determine, based on the index type, a correspondence table between a long-time preset index type and a storage calculation engine, and a correspondence table between the index type and a dimensional modeling manner of the index, a storage calculation engine and a dimensional modeling manner corresponding to the index;
a third determining module 506, configured to determine a preset dimension table associated with the indicator according to the dimension modeling manner, where the preset dimension table includes a dimension table constructed based on a dimension modeling manner corresponding to the indicator type or a dimension table constructed based on all dimension modeling manners;
a calculation module 507, configured to invoke the storage calculation engine to execute the preset dimension table by using a routing decision engine, and calculate an index value corresponding to the index;
a second obtaining module 508, configured to obtain historical data including the index, where the historical data includes the index in a specific period, the number of times the index has been accessed in the specific period, and an index factor that affects the number of times the index has been accessed in the specific period;
an analysis module 509, configured to use the historical data as sample data, perform partial correlation analysis on the sample data, extract an index, and respectively establish a mapping equation between the index and a corresponding index factor;
a checking module 510, configured to perform T-checking on the mapping relation equations respectively, and determine main index factors that affect the index access frequency;
a second query module 511, configured to query, based on the type of the index, a model construction method corresponding to the index from a correspondence table between preset index types and model construction methods;
a second constructing module 512, configured to, when the index is an index type index that needs multidimensional aggregation, use dimensional modeling to construct a stochastic report and/or a semi-aggregated report, and store the stochastic report and/or the semi-aggregated report to a non-aggregation engine and/or a semi-aggregation engine;
the first storage module 513 is configured to use a wide table for modeling when the index is an index type index with a fixed dimension, construct an aggregated report, and store the aggregated report to an aggregation engine;
the index degradation module 514 is used for degrading the index when the index is the index needing multi-dimensional aggregation and storing the index into a random report or a semi-aggregated report;
a fourth determining module 515, configured to query a storage calculation engine corresponding to the index requiring multidimensional aggregation and a preset dimension table required to be associated, and determine a dimension table required to be associated when the index type index requiring multidimensional aggregation is calculated;
the second storage module 516 is configured to, when the index is an index with a fixed dimension, store all fields in the dimension to the aggregated report by using wide table modeling;
and the third storage module 517 is configured to query a storage calculation engine corresponding to the fixed-dimension index type index, and store the aggregated report to an aggregation engine.
Wherein the first building block 502 is specifically configured to:
analyzing the data to be predicted, and defining a plurality of indexes;
grading the indexes by using a preset model, and adding dimension attributes;
and combining the indexes and the dimension attributes based on the indexes and the dimension attributes to obtain indexes of a plurality of different dimension attributes.
The determining module 503 is specifically configured to:
determining main index factors influencing the index access frequency based on a linear regression algorithm;
establishing a mapping relation equation of the index and the main index factor, and predicting a parameter value of the main index factor by adopting an elastic coefficient method;
substituting the parameter value of the index factor into the mapping relation equation, and calculating the access frequency of the index
The first determining module 504 is specifically configured to:
if the access frequency of the index is greater than a preset threshold value and other dimension tables need to be associated when the access frequency of the index is calculated, the index is an index type needing multi-dimensional aggregation;
and if the access frequency of the index is greater than a preset threshold value and other dimension tables do not need to be associated when the access frequency of the index is calculated, the index type is the index type with fixed dimensions.
In the embodiment, the data to be predicted is mainly obtained and analyzed to construct indexes with multiple dimensional attributes, and the calculation requirement of the indexes is determined according to the access frequency of the indexes predicted by the linear regression algorithm. According to the calculation requirement of the index, an appropriate mode is selected to store the index to the corresponding storage calculation engine, the index value of the index is calculated, the contradiction between the calculation time consumption and the timeliness of the index I of the big data fixed dimension is solved, and the technical problem that only a single data engine and dimension modeling can be used is solved.
Fig. 4 and fig. 5 describe the big data index construction apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the big data index construction apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 6 is a schematic structural diagram of a big data index building apparatus according to an embodiment of the present invention, where the big data index building apparatus 600 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 610 (e.g., one or more processors) and a memory 620, and one or more storage media 630 (e.g., one or more mass storage devices) storing applications 633 or data 632. Memory 620 and storage medium 630 may be, among other things, transient or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations on the big data index construction apparatus 600. Still further, the processor 610 may be configured to communicate with the storage medium 630, and execute a series of instruction operations in the storage medium 630 on the big data index construction apparatus 600 to implement the steps of the big data index construction method in the embodiments described above.
The big data index building apparatus 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input-output interfaces 660, and/or one or more operating systems 631, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the large data index build device configuration shown in FIG. 6 does not constitute a limitation of a large data index build device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
The present invention also provides a big data index construction device, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invoking the instructions in the memory to cause the big data index construction apparatus to perform the steps of the big data index construction method, comprising:
acquiring data to be predicted;
analyzing the data to be predicted, and constructing a plurality of indexes carrying different dimension attribute information;
calculating the access frequency of the index according to a linear regression algorithm, and judging whether the index is associated with a preset dimension table or not;
determining an index type of the index based on the access frequency, wherein the index type comprises an index of multi-dimensional aggregation and an index of a fixed dimension;
based on the index type, determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and the storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index;
determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
and calling the storage calculation engine by using a routing decision engine to execute the preset dimension table, and calculating an index value corresponding to the index.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and may also be a volatile computer-readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the big data index construction method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A big data index construction method is characterized by comprising the following steps:
acquiring data to be predicted;
analyzing the data to be predicted, and constructing a plurality of indexes carrying different dimension attribute information;
calculating the access frequency of the index according to a linear regression algorithm, and judging whether the index is associated with a preset dimension table or not;
determining an index type of the index based on the access frequency, wherein the index type comprises an index of multi-dimensional aggregation and an index of a fixed dimension;
based on the index type, determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and the storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index;
determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
and calling the storage calculation engine by using a routing decision engine to execute the preset dimension table, and calculating an index value corresponding to the index.
2. The big data index construction method according to claim 1, wherein the analyzing the data to be predicted to construct a plurality of indexes carrying different dimensional attribute information comprises:
analyzing the data to be predicted, and defining a plurality of indexes;
grading the indexes by using a preset model, and adding dimension attributes;
and combining the indexes and the dimension attributes based on the indexes and the dimension attributes to obtain indexes of a plurality of different dimension attributes.
3. The big data index construction method according to claim 1, wherein before the calculating the access frequency of the index according to the linear regression algorithm and determining whether the index is associated with a preset dimension table, the method further comprises:
acquiring historical data containing the indexes, wherein the historical data comprises the indexes in a specific period, the access times of the indexes in the specific period, and index factors influencing the access times of the indexes in the specific period;
taking the historical data as sample data, performing partial correlation analysis on the sample data, extracting indexes, and respectively establishing a mapping relation equation of the indexes and corresponding index factors;
and respectively carrying out T test on the mapping relation equations to determine main index factors influencing the index access frequency.
4. The big data index construction method according to claim 1, wherein the calculating the access frequency of the index according to a linear regression algorithm and determining whether the index is associated with a preset dimension table comprises:
determining main index factors influencing the index access frequency based on a linear regression algorithm;
establishing a mapping relation equation of the index and the main index factor, and predicting a parameter value of the main index factor by adopting an elastic coefficient method;
and substituting the parameter value of the index factor into the mapping relation equation, and calculating the access frequency of the index.
5. The big data index construction method according to claim 1, wherein the determining the index type of the index based on the access frequency comprises:
if the access frequency of the index is greater than a preset threshold value and other dimension tables need to be associated when the access frequency of the index is calculated, the index is an index type needing multi-dimensional aggregation;
and if the access frequency of the index is greater than a preset threshold value and the access frequency of the index does not need to be associated with other dimension tables when the index is calculated, the index type is the index type with fixed dimensions.
6. The big data index construction method according to claim 1, further comprising, after the determining the index type of the index based on the access frequency:
inquiring a model construction method corresponding to the index from a corresponding relation table between a preset index type and a model construction method based on the index type;
if the index is an index type index needing multi-dimensional polymerization, constructing a random report and/or a semi-polymerization report by using dimensional modeling, and storing the random report and/or the semi-polymerization report to a non-polymerization engine and/or a semi-polymerization engine;
and if the index is an index type index with fixed dimensionality, modeling by using a wide table, constructing an aggregated report, and storing the aggregated report to an aggregation engine.
7. The big data index construction method according to claim 1, wherein after determining the storage calculation engine and the dimensional modeling mode corresponding to the index according to a preset correspondence table between the index type and the storage calculation engine and a correspondence table between the index type and the dimensional modeling mode of the index based on the index type, the method further comprises:
if the index is the index needing multi-dimensional polymerization, degrading the index and storing the index into a random report or a semi-polymerization report;
inquiring a storage calculation engine corresponding to the index needing multidimensional polymerization and a preset dimension table needing to be associated, and determining the dimension table needing to be associated when the index type index needing multidimensional polymerization is calculated;
if the index is the index of the fixed dimension, storing all fields in the dimension to an aggregated report by utilizing wide table modeling;
and querying a storage calculation engine corresponding to the index type index of the fixed dimension, and storing the aggregated report to an aggregation engine.
8. A big data index construction device, characterized in that the big data index construction device comprises:
the first acquisition module is used for acquiring data to be predicted;
the first construction module is used for analyzing the data to be predicted so as to construct a plurality of indexes carrying different dimension attribute information;
the judging module is used for calculating the access frequency of the index according to a linear regression algorithm and judging whether the index is associated with a preset dimension table or not;
a first determination module that determines an index type of the index based on the access frequency, wherein the index type includes an index of multi-dimensional aggregation and an index of fixed dimension;
the second determination module is used for determining a storage calculation engine and a dimensional modeling mode corresponding to the index according to a corresponding relation table between a preset index type and a storage calculation engine and a corresponding relation table between the index type and the dimensional modeling mode of the index based on the index type;
the third determining module is used for determining a preset dimension table associated with the index according to the dimension modeling mode, wherein the preset dimension table comprises a dimension table constructed based on the dimension modeling mode corresponding to the index type or a dimension table constructed based on all the dimension modeling modes;
and the calculation module is used for calling the storage calculation engine to execute the preset dimension table by utilizing a routing decision engine, and calculating the index value corresponding to the index.
9. A big data index construction device, comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the big data index construction apparatus to perform the big data index construction method of any of claims 1-7.
10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the big data index construction method according to any of claims 1-7.
CN202010714909.9A 2020-07-23 2020-07-23 Big data index construction method, device, equipment and storage medium Withdrawn CN111859299A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010714909.9A CN111859299A (en) 2020-07-23 2020-07-23 Big data index construction method, device, equipment and storage medium
PCT/CN2020/131753 WO2021139427A1 (en) 2020-07-23 2020-11-26 Big data index construction method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010714909.9A CN111859299A (en) 2020-07-23 2020-07-23 Big data index construction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111859299A true CN111859299A (en) 2020-10-30

Family

ID=72950832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010714909.9A Withdrawn CN111859299A (en) 2020-07-23 2020-07-23 Big data index construction method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111859299A (en)
WO (1) WO2021139427A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990669A (en) * 2021-02-24 2021-06-18 平安健康保险股份有限公司 Product data analysis method and device, computer equipment and storage medium
WO2021139427A1 (en) * 2020-07-23 2021-07-15 平安科技(深圳)有限公司 Big data index construction method, apparatus and device, and storage medium
CN113420096A (en) * 2021-06-22 2021-09-21 平安科技(深圳)有限公司 Index system construction method, device, equipment and storage medium
CN117520624A (en) * 2024-01-05 2024-02-06 青岛海信信息科技股份有限公司 Configuration and calculation method and device for big data index

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408179B (en) * 2014-12-15 2018-11-06 北京国双科技有限公司 Data processing method and device in tables of data
US10824607B2 (en) * 2016-07-21 2020-11-03 Ayasdi Ai Llc Topological data analysis of data from a fact table and related dimension tables
CN107918600B (en) * 2017-11-15 2021-11-23 泰康保险集团股份有限公司 Report development system and method, storage medium and electronic equipment
CN109325648A (en) * 2018-06-29 2019-02-12 深圳市彬讯科技有限公司 Multi-dimensional data stream statistics method, server and storage medium based on index
CN111859299A (en) * 2020-07-23 2020-10-30 平安科技(深圳)有限公司 Big data index construction method, device, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021139427A1 (en) * 2020-07-23 2021-07-15 平安科技(深圳)有限公司 Big data index construction method, apparatus and device, and storage medium
CN112990669A (en) * 2021-02-24 2021-06-18 平安健康保险股份有限公司 Product data analysis method and device, computer equipment and storage medium
CN113420096A (en) * 2021-06-22 2021-09-21 平安科技(深圳)有限公司 Index system construction method, device, equipment and storage medium
CN113420096B (en) * 2021-06-22 2024-05-10 平安科技(深圳)有限公司 Index system construction method, device, equipment and storage medium
CN117520624A (en) * 2024-01-05 2024-02-06 青岛海信信息科技股份有限公司 Configuration and calculation method and device for big data index
CN117520624B (en) * 2024-01-05 2024-04-12 青岛海信信息科技股份有限公司 Configuration and calculation method and device for big data index

Also Published As

Publication number Publication date
WO2021139427A1 (en) 2021-07-15

Similar Documents

Publication Publication Date Title
CN111859299A (en) Big data index construction method, device, equipment and storage medium
Morales et al. A financial stability index for Colombia
CN105868373B (en) Method and device for processing key data of power business information system
CN106951984A (en) A kind of dynamic analyzing and predicting method of system health degree and device
WO2023098571A1 (en) Method and apparatus for evaluating mature state of enterprise digital middle platform
WO2003096237A2 (en) Electronic data processing system and method of using an electronic data processing system for automatically determining a risk indicator value
Yang Measuring software product quality with ISO standards base on fuzzy logic technique
Batini et al. A Framework And A Methodology For Data Quality Assessment And Monitoring.
Gawande Comparing theories of endogenous protection: Bayesian comparison of Tobit models using Gibbs sampling output
CN115879829A (en) Evaluation expert screening method applied to platform innovation capability examination and verification
Perianes-Rodriguez et al. A comparison of two ways of evaluating research units working in different scientific fields
CN114661568A (en) Abnormal operation behavior detection method, device, equipment and storage medium
Jadhav et al. Parametric and non-parametric estimation of value-at-risk
Zhang et al. A grey measurement of product complexity
Pridmore et al. Interoperability-how do we know when we have achieved it?(Military systems)
Wahyudi et al. Data Quality Assessment Using Tdqm Framework: A Case Study Of Pt Aid
CN113420096B (en) Index system construction method, device, equipment and storage medium
CN117522419B (en) Resource allocation method applied to customer relationship management system
Kubenka et al. Implementation of standards into predictors of financial stability
US20230367591A1 (en) A method for assessing quality of open source projects
Hempfing et al. Combining the Granular and Network Origins of Aggregate Fluctuations
CN114996112A (en) System performance evaluation method and device, storage medium and electronic equipment
Shuhai et al. Research and Construction of Supply Chain Information Quality Index System based on PSP/IQ Model
Ren et al. The two stage group decision making model for massive alternatives based on the difference scale of expert evaluation
Jamshidy Naeiny et al. Estimation of returns to scale with reduced computational complexity in Data Envelopment Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201030

WW01 Invention patent application withdrawn after publication