CN112966808A

CN112966808A - Data analysis method, device, server and readable storage medium

Info

Publication number: CN112966808A
Application number: CN202110098658.0A
Authority: CN
Inventors: 唐玏
Original assignee: China Mobile Communications Group Co Ltd; MIGU Music Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Music Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-06-15

Abstract

The embodiment of the invention relates to the field of data analysis, and discloses a data analysis method, a data analysis device, a server and a readable storage medium. In the invention, a data table calling flow sequence is obtained; calling a flow sequence according to the data table to generate a new calling sequence with consistent sequence length; inputting the new calling sequence into a preset neural network for training to obtain a relation matrix; processing the relation matrix to obtain a relation value between data tables; and constructing a cold-hot relationship thermodynamic diagram according to the relationship values among the data tables so as to analyze the data. The calling relation among the data tables is more accurately determined, and the accuracy of an analysis conclusion is improved.

Description

Data analysis method, device, server and readable storage medium

Technical Field

The embodiment of the invention relates to the field of data analysis, in particular to a data analysis method, a data analysis device, a server and a readable storage medium.

Background

In a data warehouse, original data are generally synchronized from a database of each business system, and the original data are used as a source layer data table of the data warehouse and are a foundation for the construction of a data model on the upper layer of the subsequent data warehouse. In order to better manage data assets, the cooling and heating degrees of the data tables of the source layer need to be judged, and important monitoring is performed on the data tables with high heating degrees (namely high importance).

In the related technology, the cold and hot degree of the data table is generally judged through the called frequency of the data table in the data processing process, but the method for counting the called frequency is single and is not enough to accurately reflect the cold and hot degree of the data table, and errors may be generated in the judgment of the importance of the data table.

Disclosure of Invention

The embodiment of the invention aims to provide a data analysis method, a data analysis device, a server and a readable storage medium, which can more accurately determine the calling relationship among data tables and improve the accuracy of an analysis conclusion.

In order to solve the above technical problem, an embodiment of the present invention provides a data analysis method, including the following steps:

acquiring a data sheet calling flow sequence;

calling a flow sequence according to the data table to generate a new calling sequence with consistent sequence length;

inputting the new calling sequence into a preset neural network for training to obtain a relation matrix;

processing the relation matrix to obtain a relation value between data tables;

and constructing a cold-hot relationship thermodynamic diagram according to the relationship values among the data tables so as to analyze the data.

An embodiment of the present invention also provides a data analysis apparatus, including:

the acquisition module is used for acquiring a data sheet calling flow sequence;

the sequence processing module is used for calling the flow sequence according to the data table and generating a calling new sequence with consistent sequence length;

the network training module is used for inputting the calling new sequence into a preset neural network for training to obtain a relation matrix;

the relation calculation module is used for processing the relation matrix to obtain a relation value between data tables;

and the analysis module is used for forming a cold-hot relationship thermodynamic diagram according to the relationship values among the data tables so as to analyze the data.

An embodiment of the present invention further provides a server, including: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the data analysis methods.

Embodiments of the present invention also provide a computer-readable storage medium, which when executed by a processor implements any of the data analysis methods described above.

Compared with the related technology, the method and the device for judging the compactness of the data tables do not simply adopt the calling times as the judgment basis of the relation between the data tables, obtain the relation between the data tables through the preset neural network, are more accurate than the traditional compactness judgment method, and avoid the situation that some data tables are not critical although the calling times are more. And a thermodynamic diagram is formed by the relation values among the data tables, and the close relation of the data table time can be seen through the thermodynamic diagram, so that the data analysis is clearer and more intuitive. And quantizing the calling relationship compactness information among the data tables by adopting a mathematical method, and obtaining the relationship value among the data tables through operation. The relation value not only contains the calling frequency of the data tables, but also more importantly contains the close co-occurrence among the data tables, the thermodynamic degree of the data tables can be more accurately reflected through the thermodynamic tables, and a basis is provided for the follow-up key monitoring guarantee.

In addition, the data analysis method according to the embodiment of the present invention further includes, after inputting the new calling sequence into a preset neural network for training: and reducing the loss value of the neural network by a random gradient descent method.

In addition, the data analysis method according to an embodiment of the present invention, where the invoking new sequence is input to a preset neural network for training to obtain a relationship matrix, includes: and when the loss value is smaller than a preset loss threshold value, taking a matrix generated by the neural network as the relation matrix.

In addition, the data analysis method according to an embodiment of the present invention, where the processing the relationship matrix to obtain the relationship value between the data tables, includes: and performing dot multiplication operation on the vectors in the relation matrix, and taking the obtained dot multiplication result as the relation value, wherein the vector in each relation matrix corresponds to one data table.

In addition, the data analysis method according to an embodiment of the present invention includes: processing the data tables outside the calling new sequence intermediate data table to generate a predicted value of the intermediate data table; and inputting the predicted value of the intermediate data table and the value coded by the intermediate data table into a loss function to generate the loss value.

In addition, according to the data analysis method provided by the embodiment of the invention, the relation value represents the closeness degree of the calling relation among the data tables.

In addition, the data analysis method according to an embodiment of the present invention, where the generating of the new calling sequence with a consistent sequence length according to the data table calling process sequence includes: and performing sliding window algorithm processing on the data sheet calling flow sequence to generate a calling new sequence, wherein the sequence length of the calling new sequence is a preset sliding window value.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart of a data analysis method provided by a first embodiment of the present invention;

FIG. 2 is a thermodynamic diagram of a data analysis method provided by a first embodiment of the present invention;

FIG. 3 is a matrix C of a data analysis method provided by a second embodiment of the present invention;

FIG. 4 is a matrix C2 of a data analysis method provided by a second embodiment of the present invention;

FIG. 5 is a matrix of a data analysis method provided by a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of a data analysis device according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of a server according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

A first embodiment of the present invention relates to a data analysis method. The specific flow is shown in figure 1.

Step 101, acquiring a data sheet calling flow sequence;

a large number of data tables are stored in a data warehouse, in the processing flow of the data warehouse, the data tables are mutually called, and the calling relation among the data tables in each data processing flow is extracted. For example, in a data processing flow, A, B, C three tile source layer data tables are needed, and A-B-C represents that in this data call flow, data table A is called first, then data table B is called, and finally data table C is called. The number of data call flows is the same as the number of call flow sequences.

102, calling a flow sequence according to the data table to generate a new calling sequence with consistent sequence length;

and performing sliding window algorithm processing on the data sheet calling flow sequence to generate a calling new sequence, wherein the sequence length of the calling new sequence is a preset sliding window value.

Specifically, a sliding window value is set, which is an odd number, so that it is possible to determine the intermediate data table in which the new sequence is called.

And performing sliding window algorithm calculation on the calling program sequence according to the sliding window value to obtain a new calling sequence.

For example, a spreadsheet call sequence with A-B-C, A-D-E-G-J, B-C-F-H, where A, B, C, D, E, F, G, H, J is the pasting layer spreadsheet. Setting the sliding window value to 3, the new sequence of calls to A-B-C, A-D-E, D-E-G, E-G-J, etc. can be obtained.

103, inputting the calling new sequence into a preset neural network for training to obtain a relation matrix;

and coding the data table in the new calling sequence by using an onehot mode to obtain data table vectors, wherein one dimension in each vector is 1, and the rest vectors are 0. For example, assuming that there are M data tables, where M is 5, the data table vector may be (1,0,0,0,0), (0,1,0,0,0), (0,0,1,0,0), (0,0,0,1,0), or (0,0,0,0, 1).

And randomly initializing a matrix C and a matrix C2, wherein the sizes of the matrix C and the matrix C are MxK and KxM respectively, and the value of the matrix is a random value between 0 and 1.

And appointing the intermediate data table calling the new sequence as a predicted value, and taking the table around the intermediate data table as an input value, namely predicting the intermediate data table by using the data tables except the sliding window intermediate data table.

Taking the calling new sequence as A-B-C as an example, wherein B corresponds to a predicted value, and A and C correspond to input values.

And carrying out onehot coding on the data table in the new calling sequence, respectively multiplying the obtained values by the matrix C, and then carrying out vector averaging on the multiplied results. The average value is multiplied by a matrix C2, and the multiplication result is used by a softmax function to generate an intermediate data table prediction value.

And inputting the predicted value of the intermediate data table and the intermediate data table subjected to onehot coding into a cross entropy loss function to generate a loss value.

Then, according to the training method of random gradient descent, the neural network updates the values of the matrix C and the matrix C2 in a back propagation mode, so that the loss value is continuously reduced.

And when the loss value is smaller than the preset loss threshold value, determining that the neural network training reaches convergence, wherein a relatively stable matrix generated by the neural network is a relation matrix.

In a specific example, in the new sequence a-B-C, if M is 5, onehot of the data table a is coded as (1,0,0,0,0), onehot of the data table B is coded as (1,0,0,0,0), and coding of the data table C is coded as (0,0,1,0, 0).

A (1,0,0,0,0) times matrix c (mxk) to yield (0.1,0.3,0.5,0.2,0.1), B (1,0,0,0,0) times matrix c (mxk) to yield (0.9,0.7,0.3,0.8, 0.6);

averaging (0.1,0.3,0.5,0.2,0.1) and (0.9,0.7,0.3,0.8,0.6) to obtain (0.5,0.5,0.4,0.5, 0.35);

multiplying (0.5,0.5,0.4,0.5,0.35) by the matrix C2(KxM) and performing softmax function calculations yields the intermediate data table predictors (0.2,0.9,0.2,0.3, 0.1).

And (0.2,0.9,0.2,0.3,0.1) and B (1,0,0,0,0) are subjected to cross entropy loss function calculation to obtain a loss value (such as 1.2).

All data in the new calling sequence are repeatedly input into the neural network for training in a random sequence, and according to the training method of random gradient descent, the neural network updates the values of the matrix C and the matrix C2 in a back propagation mode, so that the values of the matrix C and the matrix C2 are continuously updated, and finally the loss value reaches a relatively stable and small value (such as 0.001). When the value is smaller than the preset loss threshold value, the neural network training is considered to reach convergence, the matrix generated by the neural network tends to be stable at the moment, and the matrix at the moment is used as a relation matrix.

104, processing the relation matrix to obtain a relation value between data tables;

the vectors in the relationship matrix are subjected to pairwise dot multiplication, the dot multiplication value represents the relationship between the two vectors, namely, the relationship value is a value greater than-1 and less than 1, and the closer to 1 (namely, the smaller the difference value between the relationship value and 1) is, the closer the calling relationship between the data tables is, for example, the vector dot multiplication result of a (0.83,0.55,0.91,0.96,0.71) and B (0.56,0.74,0.53,0.62,0.52) is 0.96, and 0.96 is closer to 1, so that the relationship between the data table a and the data table B can be considered to be very close.

And 105, forming a cold-hot relation thermodynamic diagram according to the relation values among the data tables so as to analyze the data.

The results (relationship values) of dot-by-dot multiplication form a cold-heat relationship thermodynamic diagram, and as shown in fig. 2, the diagonal line is the association degree of the data table itself with itself and is always 1. Lighter colors in a cold-hot relationship thermodynamic diagram indicate tighter calling relationships. And analyzing the calling relationship among the data tables through the relationship values among the data tables and the cold-hot relationship thermodynamic diagram. The data table with light color can be subjected to key monitoring and key guarantee, and data maintenance is facilitated.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A second embodiment of the present invention relates to a data analysis method. Assuming that the number of data tables is M, M is 10; the dimension of the data table vector is selected to be K, and K is 5, namely, the data table vector has 10 data tables, and each data table is represented by a 5-dimensional vector. The matrix C is initialized to a matrix of 10x5 using random values of 0-1 as shown in FIG. 3. The matrix C2 was randomly initialized to a 5x10 matrix as shown in fig. 4.

Acquiring first row data A-B-C in the calling new sequence, and inputting values A and C to the neural network;

assuming that onehot code of A is [1,0,0,0,0,0,0, 0], onehot code of C is [0,0,1,0,0,0,0,0,0,0], multiplying onehot code of A by matrix C according to the structure diagram of the neural network to obtain [0.59,0.29,0.77,0.01,0.23 ];

multiplying onehot code of C by matrix C to obtain [0.24,0.8,0.17,0.3,0.47 ];

averaging the two vectors to obtain [0.42,0.54,0.47,0.16,0.35 ];

this vector is then multiplied by the matrix C2 to yield [1.02,0.94,1.06,1.31,0.8,0.89,0.87,0.83,0.88,0.79 ];

inputting the vector into a softmax function to obtain [0.11,0.1,0.11,0.14,0.09,0.09,0.09,0.09,0.09 and 0.09], namely the predicted value of the intermediate data table;

the onehot code of B [0,1,0,0,0,0, 0] and the intermediate data table prediction value are then input into the cross entropy loss function (cross entropy loss) to obtain the loss value.

Further, all data in the new sequence P are repeatedly input into the neural network for training in a random order, so that the values of the matrix C and the matrix C2 are continuously updated, and finally the loss value reaches a relatively stable and small value, so that the neural network training reaches convergence.

After the training is converged, a stable matrix C is obtained, the matrix C is dense vector representation of the M data tables, each table is a K-dimensional vector, and the K-dimensional vector contains calling relationship information among the data tables. The matrix is shown in fig. 5.

In fig. 5, each row is a vector with k ═ 5 dimensions, which represents the vector representation of a certain data table, and a total of 10 rows represents a total of 10 data tables, including data table a, data table B, and data table C.

The correlation value between two data tables is calculated through point multiplication between vectors, a pairwise calling correlation matrix of all the data tables is obtained, and a thermodynamic diagram is shown in fig. 2.

Compared with the related technology, the embodiment trains the neural network by selecting the data table calling sequence in the existing data warehouse, and finally obtains the K-dimensional vector representation of each data table, wherein the vector contains the calling relation compactness information among the data tables. Through the operation of the K-dimensional vector, a thermodynamic diagram among all data tables can be obtained and used as a basis for subsequent key monitoring guarantee.

A third embodiment of the present invention relates to a data analysis device, as shown in fig. 6, including:

an obtaining module 601, configured to obtain a data table call flow sequence;

a sequence processing module 602, configured to generate a new calling sequence with a consistent sequence length according to the flow sequence called by the data table;

a network training module 603, configured to input the new calling sequence into a preset neural network for training, so as to obtain a relationship matrix;

a relation calculation module 604, configured to process the relation matrix to obtain a relation value between data tables;

and the analysis module 605 is configured to construct a cold-hot relationship thermodynamic diagram according to the relationship values between the data tables, so as to analyze the data.

It should be understood that this embodiment is a system example corresponding to the first embodiment, and may be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A fourth embodiment of the present invention relates to a server, as shown in fig. 7, including:

at least one processor 701; and a memory 702 communicatively coupled to the at least one processor 701;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the data call methods.

The memory and the processor are connected by a bus, which may include any number of interconnected buses and bridges, linking together one or more of the various circuits of the processor and the memory. The bus may also link various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method of data analysis, comprising:

acquiring a data sheet calling flow sequence;

processing the relation matrix to obtain a relation value between data tables;

2. The data analysis method of claim 1, wherein after inputting the new calling sequence into a preset neural network for training, the method further comprises:

and reducing the loss value of the neural network by a random gradient descent method.

3. The data analysis method according to any one of claims 1-2, wherein the inputting the new calling sequence into a preset neural network for training to obtain a relationship matrix comprises:

and when the loss value is smaller than a preset loss threshold value, taking a matrix generated by the neural network as the relation matrix.

4. The data analysis method of claim 1, wherein the processing the relationship matrix to obtain relationship values between data tables comprises:

and performing dot multiplication operation on the vectors in the relation matrix, and taking an obtained dot multiplication result as the relation value, wherein the vector in any relation matrix corresponds to one data table.

5. The data analysis method of claim 2, wherein obtaining a loss value comprises:

processing the data tables outside the calling new sequence intermediate data table to generate a predicted value of the intermediate data table;

and inputting the predicted value of the intermediate data table and the value coded by the intermediate data table into a loss function to generate the loss value.

6. The data analysis method of claim 1, wherein the relationship value represents how close a calling relationship between data tables is.

7. The data analysis method of claim 1, wherein the generating a new call sequence with a consistent sequence length according to the data table call flow sequence comprises:

8. A data analysis apparatus, comprising:

9. A server, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data analysis method of any one of claims 1-7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the data analysis method of any one of claims 1 to 7.