CN114116920B - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114116920B
CN114116920B CN202111404573.7A CN202111404573A CN114116920B CN 114116920 B CN114116920 B CN 114116920B CN 202111404573 A CN202111404573 A CN 202111404573A CN 114116920 B CN114116920 B CN 114116920B
Authority
CN
China
Prior art keywords
dimension
sample data
determining
data
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111404573.7A
Other languages
Chinese (zh)
Other versions
CN114116920A (en
Inventor
郭枝虾
张超颖
梁宝林
王建秀
马思聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202111404573.7A priority Critical patent/CN114116920B/en
Publication of CN114116920A publication Critical patent/CN114116920A/en
Application granted granted Critical
Publication of CN114116920B publication Critical patent/CN114116920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The disclosure provides a data processing method, a data processing device, an electronic device and a storage medium. The method comprises the following steps: carrying out random sampling on a data model of a data warehouse for multiple times to obtain sample data of multiple samples; determining target sample data according to the cost values of the plurality of sample data; coding each element of each field of each dimension in the target sample data to obtain a coding sequence of each element; determining the number of field repetitions with the same elements in each dimension according to the coding sequence of each element; the granularity of each dimension is determined according to the total length of each dimension and the number of field repetitions with the same element in each dimension. The method quickly and accurately determines the granularity of each dimension of the data model in the data warehouse.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
When the correlation path is established by automatic modeling in the data warehouse, a Cartesian product exists when the model established by using the coarse granularity as the correlation key is correlated with the model established by using the fine granularity dimension as the correlation key. In order to avoid the cartesian product occurring during the association of the data models, the dimension granularity needs to be judged to extract the optimal association path, thereby realizing automatic modeling.
In the related art, a method for determining the dimension granularity of a model in a data warehouse is count (distinting dimension field), where distinting performs an operation of group by plus order by. The distint statement executes a deduplication function, the group by statement is used for combining a summation function, the result sets are grouped according to one or more columns, and the order by statement is used for sorting the result sets according to the specified columns.
When the data volume of the database is large, the group by execution performance is poor, and additionally the order by operation is performed, and when a data warehouse with the data magnitude of hundreds of millions is faced, the count (discrete dimension field) is calculated slowly, and the requirement of real-time calculation cannot be met.
It is noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The invention aims to provide a data processing method, a data processing device, an electronic device and a storage medium, wherein the method can be used for quickly and accurately determining the granularity of each dimension of a data model in a data warehouse.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
The embodiment of the disclosure provides a data processing method, which includes: carrying out random sampling on a data model of a data warehouse for multiple times to obtain sample data of multiple samples; determining target sample data according to the cost values of the plurality of sample data; coding each element of each field of each dimension in the target sample data to obtain a coding sequence of each element; determining the number of field repetitions with the same elements in each dimension according to the coding sequence of each element; the granularity of each dimension is determined according to the total length of each dimension and the number of field repetitions with the same element in each dimension.
In an exemplary embodiment, determining target sample data from cost values for the plurality of sample data comprises: determining the information entropy of each column of each sample data; determining the weighted information entropy of each sample data according to the information entropy and the attenuation factor of each column of each sample data; determining a punishment item of each sample data according to the median and the minimum of the information entropy of each column of each sample data; determining the cost value of each sample data according to the weighting information entropy of each sample data and the punishment item of each sample data; and determining the sample data with the largest cost value as the target sample data.
In an exemplary embodiment, determining a penalty term for each sample data according to the median and the minimum of the information entropy of each column of said each sample data, includes: and taking the reciprocal of the difference between the median and the minimum of the information entropy of each column of each sample data to obtain the punishment item of each sample data.
In an exemplary embodiment, determining the cost value of each sample data according to the weighting information entropy of each sample data and the penalty term of each sample data comprises: and taking the sum of the weighting information entropy of each sample data and the penalty term of each sample data as the cost value of each sample data.
In an exemplary embodiment, each element comprises a first element, the encoding sequence comprises encodings, the number of encodings in the encoding sequence is equal to the number of fields contained in each dimension, each dimension comprises at least one column; encoding each element of each field of each dimension in the target sample data to obtain an encoding sequence of each element, including: determining a target element which is the same as the first element in the column where the first element is located; determining a coding bit corresponding to the target element in a coding sequence of the first element; and setting the codes of the code bits in the code sequence of the first element as 1, and setting the rest codes as 0 to obtain the code sequence of the first element.
In an exemplary embodiment, each dimension includes a first column and a second column; determining the number of field repetitions with the same elements in each dimension according to the coding sequence of each element, wherein the determining comprises the following steps: performing a phase-and-sum calculation on the coding sequence of each element in the first column and the coding sequence of each element in the second column in each dimension to obtain a plurality of intermediate coding sequences; determining the intermediate coding sequences with the number of codes being 1 and larger than 1 as target intermediate coding sequences; and counting the number of the target intermediate coding sequences to obtain the number of field repetitions with the same elements in each dimension.
In an exemplary embodiment, the total length of each dimension is equal to the number of fields contained in each dimension; determining the granularity of each dimension according to the total length of each dimension and the number of field repetitions with the same element in each dimension, wherein the determining comprises the following steps: the difference between the total length of each dimension and the number of field repetitions with the same element in each dimension is determined as the granularity of each dimension.
An embodiment of the present disclosure provides a data processing apparatus, including: the sampling module is used for randomly sampling the data model of the data warehouse for multiple times to obtain multiple sample data; the determining module is used for determining target sample data according to the cost values of the plurality of sample data; the encoding module is used for encoding each element of each field of each dimension in the target sample data to obtain an encoding sequence of each element; the determining module is further used for determining the number of field repetition with the same element in each dimension according to the encoding sequence of each element; the determining module is further configured to determine a granularity of each dimension according to a total length of each dimension and a number of field repetitions having the same element in each dimension.
An embodiment of the present disclosure provides an electronic device, including: at least one processor; a storage terminal device for storing at least one program that, when executed by at least one processor, causes the at least one processor to implement any one of the data processing methods described above.
The disclosed embodiment provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program is used for implementing any one of the data processing methods when being executed by a processor.
According to the data processing method provided by some embodiments of the disclosure, target sample data with high dimensionality granularity representativeness can be determined according to the information entropy of a plurality of sample data, so that the problem of poor representativeness of the sample extracted by a sampling method in the related art is avoided; each element of each field of each dimension in the target sample data is coded to obtain a coded sequence of each element, so that subsequent data processing is facilitated, and the calculation speed and the calculation precision are improved; according to the coding sequence of each element, the repeated number of fields with the same elements in each dimension can be quickly and accurately determined; according to the total length of each dimension and the repeated number of the fields with the same elements in each dimension, the granularity of each dimension of each data model in the data warehouse can be determined quickly and accurately, and therefore the granularity of all the data models can be judged in real time.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It should be apparent that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived by those of ordinary skill in the art without inventive effort.
FIG. 1 is a flow chart illustrating a method of data processing according to an exemplary embodiment.
Fig. 2 is a schematic diagram illustrating a sampling algorithm implementation according to an example.
FIG. 3 is a schematic diagram of an encoding algorithm to compute a granularity of a dimension shown according to an example.
FIG. 4 is a flow diagram illustrating a data process according to an example.
FIG. 5 is a block diagram illustrating a data processing device according to an example embodiment.
FIG. 6 is a schematic diagram of an electronic device shown in accordance with an exemplary embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor terminal devices and/or microcontroller terminal devices.
Hereinafter, each step of the data processing method in the exemplary embodiment of the present disclosure will be described in more detail with reference to the drawings and the embodiments.
FIG. 1 is a flow chart illustrating a method of data processing according to an exemplary embodiment. The method provided by the embodiment of the present disclosure may be executed by a server, but the present disclosure is not limited thereto.
As shown in fig. 1, a data processing method provided by an embodiment of the present disclosure may include the following steps.
In step S102, a plurality of random samples are performed on the data model of the data warehouse to obtain a plurality of sample data.
In the embodiment of the disclosure, multiple random sampling can be performed on the data model in the data warehouse, so as to obtain multiple sampling sample data. The data warehouse may include data in various business scenarios, and the data model may also be referred to as a table.
Fig. 2 is a schematic diagram illustrating a sampling algorithm implementation according to an example.
Referring to fig. 2, for example, the data model of the data warehouse (Dataware) may be Flow, ord, sku, wherein Flow may represent a Flow topic, ord may represent an order topic, and Sku may represent a commodity topic.
In the embodiment of the disclosure, a data model of a data warehouse may be sampled by a cluster master node (Mater) and a plurality of cluster Slave nodes (Slave No1 \8230;), so as to obtain a plurality of sample data samples (Random Shuffle). The plurality of sample data may include a plurality of sample data models, and each data model may include a plurality of fields.
In step S104, target sample data is determined according to cost values of a plurality of sample data.
In the embodiment of the present disclosure, the target sample data may be determined from a plurality of sample data according to the cost value of each sample data.
In an exemplary embodiment, determining target sample data from cost values of a plurality of sample data may comprise: determining the information entropy of each column of each sample data; determining the weighted information entropy of each sample data according to the information entropy and the attenuation factor of each column of each sample data; determining a punishment item of each sample data according to the median and the minimum of the information entropy of each column of each sample data; determining the cost value of each sample data according to the weighting information entropy of each sample data and the punishment item of each sample data; and determining the sample data with the largest cost value as the target sample data.
In an exemplary embodiment, determining the penalty term for each sample data according to the median and the minimum of the information entropy of each column of each sample data, includes: and taking the reciprocal of the difference between the median and the minimum of the information entropy of each column of each sample data sample to obtain a penalty item of each sample data sample.
In an exemplary embodiment, determining the cost value of each sample data according to the weighting information entropy of each sample data and the penalty term of each sample data comprises: and taking the sum of the weighting information entropy of each sample data and the punishment item of each sample data as the cost value of each sample data.
The following description will be given taking a sample data as an example.
For example, can be obtained byThe formula calculates the information entropy (e) of each column in the sample data i ,entropy):
e i (x)=-Σp(x)logp(x)
Wherein e is i (x) Denotes the information entropy of the ith column, i is an integer of 1 or more, and p (x) denotes the probability of the occurrence of the element x.
In the embodiment of the present disclosure, all columns of the sample data may be sorted in a descending order according to the information entropy of all columns in the sample data.
In the embodiment of the present disclosure, the entropy of all columns in the sample data of the sample may be multiplied by an attenuation factor (f) n ) The Weighted Information Entropy E (Weighted Information Entropy) of the sample data is obtained by the following formula:
Figure BDA0003372331930000061
wherein f is n Can be represented by the formula f n =ρ n Calculation of where p n May represent the attenuation coefficient.
In the embodiment of the present disclosure, a median (e) of entropy of each column of information in the sample data of the sample may be calculated i ) And a minimum value (min (e)) i ) ) and then taking the reciprocal to obtain a Penalty Term P (Penalty Term):
P=1/(median(e i )-min(e i ))
in the embodiment of the present disclosure, the weighted information entropy of the sample data and the penalty thereof may be added to obtain a Cost value (Cost) of the sample data:
Cost=E+P
in the embodiment of the present disclosure, the Cost values of all sample data may be calculated by the above method and arranged in a descending order, and the sample data with the highest Cost value is taken as the target sample data, that is, the sample data of top1 is taken as the result sample data of sampling.
In the embodiment of the present disclosure, the process of determining the target sample data may be referred to as an EDW (information entropy, attenuation factor, weight distribution, based on information entropy, attenuation factor, and weight distribution) sample sampling method.
In the embodiment of the disclosure, the sample sampling method based on the information entropy and the attenuation factor can extract target sample data with high dimensionality and granularity information representativeness, and can solve the problem of sampling errors existing in the sampling method in the related technology.
In step S106, each element of each field of each dimension in the target sample data is encoded, and an encoding sequence of each element is obtained.
In embodiments of the present disclosure, the target sample data may include one or more dimensions, each of which may include one or more fields, each of which may include one or more elements.
In the embodiment of the present disclosure, each element of each field of each dimension in the target sample data may be encoded, and an encoding sequence of each element of each field of each dimension is obtained.
If each dimension includes only one field, which includes one or more elements, each specific element of each dimension may be encoded; if each dimension includes multiple fields, each particular element of each field of each dimension may be encoded.
In an exemplary embodiment, each element comprises a first element, the encoding sequence comprises an encoding, the number of encodings in the encoding sequence is equal to the number of fields contained in each dimension, and each dimension comprises at least one column.
One field may include one or more elements, and the following description takes an encoding process of one element (the first element) as an example, and encoding processes of other elements are similar, which is not described again in this disclosure.
The encoding sequence of the first element may include one or more encodings, wherein the number of encodings in the encoding sequence may be the same as the number of fields contained in the dimension in which the element is located.
FIG. 3 is a schematic diagram of an encoding algorithm that computes a granularity of a dimension, shown according to an example.
For example, referring to fig. 3, 301 may represent one dimension of target sample data, where each row may represent one field, that is, the dimension 301 may include 8 fields: ab1, bb1, bd1, ca1, ab1, cd1, db1, bb1, the coding sequence obtained after coding each element may comprise 8 codes. Each field may include 2 elements, for example, the first field includes element a and element b1.
In the embodiment of the present disclosure, the elements a, b, c, d, a1, b1, and d1 may be encoded (i.e., (1) encoded), and an encoding sequence of each element is obtained.
In an exemplary embodiment, encoding each element of each field of each dimension in the target sample data, obtaining an encoded sequence of each element, comprises: determining a target element which is the same as the first element in the column of the first element; determining a coding bit corresponding to the target element in the coding sequence of the first element; and setting the codes of the code bits in the code sequence of the first element as 1, and setting the rest codes as 0 to obtain the code sequence of the first element.
For example, with continued reference to fig. 3, the encoding process of element a is described as an example below.
For example, if the dimension of the element a includes 8 fields, the encoding sequence corresponding to the element a includes 8 encodings.
First, the elements in the column (abbcacdb) where the element a is located, which are the same as the element a, i.e. the first element, the fifth element; then, the coded bit corresponding to the first element (i.e., the first bit) may be determined, and the coded bit corresponding to the fifth element (i.e., the fifth bit) may be determined; setting the first bit and the fifth bit in the coding sequence of the element a as 1, and setting the rest bits as 0, so as to obtain the coding sequence of the element a as 10001000.
By analogy, it can be obtained that the encoding sequence of element b is 01100001, the encoding sequence of element c is 00010100, the encoding sequence of element d is 00000010, the encoding sequence of element a1 is 00010000, the encoding sequence of element b1 is 11001001, and the encoding sequence of element d1 is 00100100.
In step S108, the number of field repetitions where the elements are the same in each dimension is determined from the encoded sequence of each element.
In the embodiment of the present disclosure, the number of field repetitions with the same element in each dimension may be determined according to the encoding sequence of each element of each field in each dimension.
The fields with the same element can be one or more, that is, one or more fields with the same element can be included in one dimension.
With continued reference to fig. 3, for example, in dimension 301, there are 2 fields ab1 and 2 fields bb1, that is, there are two fields with the same element in dimension 301, namely ab1 and bb1, respectively, and the number of field repetitions with the same element is 2 (i.e., 1 repetition of field ab1 and 1 repetition of bb 1).
In an exemplary embodiment, each dimension includes a first column and a second column.
In the embodiment of the present disclosure, each dimension includes two columns for illustration, and in practical applications, one dimension may include one or more columns, which is not limited by the present disclosure.
With continued reference to FIG. 3, in dimension 301, the first column may be abbcacdb and the second column may be b1b1d1a1b1d1b1b1.
In an exemplary embodiment, determining the number of field repetitions for which the elements are identical in each dimension based on the encoding sequence of each element comprises: performing a phase-and-sum calculation on the coding sequence of each element in the first column and the coding sequence of each element in the second column in each dimension to obtain a plurality of intermediate coding sequences; determining the intermediate coding sequences with the number of codes being 1 and larger than 1 as target intermediate coding sequences; and counting the number of target intermediate coding sequences to obtain the number of field repetitions with the same elements in each dimension.
In the embodiment of the present disclosure, the phase and calculation based on the element encoding sequence between fields may be performed on a plurality of fields in each dimension; and counting the number of 1 in each code obtained by the phase comparison, and calculating the number of the code sequences with the number of 1 being more than 1 as the number of field repetition with the same element in each dimension.
With continued reference to fig. 3, the coding sequence for each element in the first column is: the encoding sequence of the element a is 10001000, the encoding sequence of the element b is 01100001, the encoding sequence of the element c is 00010100, and the encoding sequence of the element d is 00000010; the coding sequence of each element in the second column is: the encoding sequence of element a1 is 00010000, the encoding sequence of element b1 is 11001001, and the encoding sequence of element d1 is 00100100.
With continued reference to fig. 3, the code sequence for each element in the first column and the code sequence for each element in the second column are anded (i.e., (2) anded) to obtain a plurality of intermediate code sequences, respectively: a & a1, a & b1, a & d1, b & a1, b & d1, b & b1, c & a1, c & b1, c & d1. Wherein, the a & a1 intermediate coding sequence is as follows: 00000000, a &b1 intermediate coding sequence is: the 10001000,a &d1 intermediate coding sequence is: 0000, b and a1 intermediate coding sequence: 00000000, b and d1 intermediate coding sequence is: 00100000, b and b b1 intermediate coding sequence is: 01000001, c and a1 intermediate coding sequence: 00010000, c and b1 intermediate coding sequence: 0000, c &d1 intermediate coding sequence: 00000100.
with continued reference to FIG. 3, the intermediate code sequences with the number of codes 1 greater than 1 are determined as the target intermediate code sequences (i.e., (3) judge count (1) > 1), and if the number of codes 1 in the a & b1 intermediate code sequence and the b & b1 intermediate code sequence is greater than 1, the target intermediate code sequences are a & b1 and b & b1.
With continued reference to fig. 3, the number of target midamble sequences result2 is counted, i.e. there are 2 target midamble sequences, and the number of field repetitions with the same element in dimension 301 is 2, i.e. result2=2.
In step S110, the granularity of each dimension is determined according to the total length of each dimension and the number of field repetitions with the same element in each dimension.
In an exemplary embodiment, the total length of each dimension is equal to the number of fields contained in each dimension.
With continued reference to fig. 3, the number of fields in the dimension 301 may be counted as a total length result1 of the dimension (i.e., (4) calculated total length), and the dimension 301 includes 8 fields, i.e., the total length of the dimension 301 is 8, i.e., result1=8.
In an exemplary embodiment, determining the granularity of each dimension according to the total length of each dimension and the number of field repetitions with the same element in each dimension comprises: the difference between the total length of each dimension and the number of field repetitions with the same element in each dimension is determined as the granularity of each dimension.
In the embodiment of the disclosure, the total length result1 of each dimension is subtracted by the number result2 of field repetitions with the same element in each dimension (i.e., (5)' result1-result 2), so as to obtain the granularity result of each dimension.
With continued reference to FIG. 3, using result1 minus result2, a granularity result of 6 for dimension 301 can be obtained.
According to the data processing method provided by the embodiment of the disclosure, the target sample data with high dimensionality granularity representativeness can be determined according to the information entropy of a plurality of sample data, so that the problem of poor representativeness of the sample extracted by a sampling method in the related art is avoided; each element of each field of each dimension in the target sample data is coded to obtain a coded sequence of each element, so that subsequent data processing is facilitated, and the calculation speed and the calculation precision are improved; according to the coding sequence of each element, the repeated number of fields with the same elements in each dimension can be quickly and accurately determined; according to the total length of each dimension and the repeated number of fields with the same elements in each dimension, the granularity of each dimension of the data model in the data warehouse can be determined quickly and accurately, and therefore the real-time judgment of the granularity of all the data models is achieved.
FIG. 4 is a flow diagram illustrating a data process according to an example.
As shown in fig. 4, a data processing method provided by an embodiment of the present disclosure may include the following steps.
In step S402, the data pattern for the full number of bins is randomly sampled several times.
In step S404, the information entropy of each column of each sample data model is calculated, and all columns of the sample are sorted in descending order based on the information entropy thereof.
In step S406, the information entropies of all columns are multiplied by one attenuation factor, respectively, to obtain weighted information entropies E.
In step S408, the difference between the median and the minimum of each column of information entropy in each sample data model is calculated, and then the reciprocal is taken to obtain the penalty term P.
In step S410, a cost value of the sample data model is obtained according to the weighting information entropy E and the penalty term P.
In step S412, cost values are calculated for all sample data models and arranged in descending order, and the sample data model of top1, that is, the sample data as the result of sampling, is taken.
In step S414, each element of each dimension in each data model in the sampled result sample data is encoded.
In step S416, the total length result1 of the dimension is counted.
In step S418, a plurality of fields composing the dimension are combined, and calculation is performed based on the element code phase and between the fields.
In step S420, the number of 1S in each code obtained by the calculation is counted, and the number result2 of code sequences for which the number of 1S is greater than 1 is calculated.
In step S422, result1 and result2 are subtracted to obtain granularity result of the dimension.
Steps S402 to S414 may be completed offline, and steps S416 to S422 may be completed in real time.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
FIG. 5 is a block diagram illustrating a data processing apparatus according to an example embodiment.
As shown in fig. 5, the data processing apparatus 500 may include: a sampling module 502, a determination module 504, and an encoding module 506.
The sampling module 502 may be configured to perform multiple random sampling on a data model of a data warehouse to obtain multiple sample data; the determining module 504 may be configured to determine target sample data according to cost values of the plurality of sample data; the encoding module 506 may be configured to encode each element of each field of each dimension in the target sample data to obtain an encoding sequence of each element; the determining module 504 may be further configured to determine a number of field repetitions for which the elements are the same in each dimension according to the encoding sequence of each element; the determining module 504 may be further configured to determine the granularity of each dimension according to a total length of each dimension and a number of field repetitions having the same element in each dimension.
In an exemplary embodiment, the determining module 504 may be further configured to determine an information entropy of each column of each sample data; determining the weighted information entropy of each sample data according to the information entropy and the attenuation factor of each column of each sample data; determining a punishment item of each sample data according to the median and the minimum of the information entropy of each column of each sample data; determining the cost value of each sample data according to the weighting information entropy of each sample data and the penalty item of each sample data; and determining the sample data with the largest cost value as the target sample data.
In an exemplary embodiment, the determining module 504 may be further configured to take a reciprocal of a difference between a median and a minimum of the information entropy of each column of each sample data, and obtain a penalty term for each sample data.
In an exemplary embodiment, the determining module 504 may be further configured to use a sum of the weighted information entropy of each sample data and the penalty term of each sample data as the cost value of each sample data.
In an exemplary embodiment, each element comprises a first element, the encoding sequence comprises encodings, the number of encodings in the encoding sequence is equal to the number of fields contained in each dimension, each dimension comprises at least one column; the determining module 504 may be further configured to determine a target element in the column where the first element is located, where the target element is the same as the first element; determining a coding bit corresponding to the target element in the coding sequence of the first element; and setting the codes of the coding bits in the coding sequence of the first element to be 1, and setting the rest codes to be 0 to obtain the coding sequence of the first element.
In an exemplary embodiment, each dimension includes a first column and a second column; wherein the determining module 404 is further configured to perform a phase-and calculation on the coding sequence of each element in the first column and the coding sequence of each element in the second column in each dimension to obtain a plurality of intermediate coding sequences; determining the intermediate coding sequences with the number of codes being 1 and larger than 1 as target intermediate coding sequences; and counting the number of the target intermediate coding sequences to obtain the number of field repetition with the same element in each dimension.
In an exemplary embodiment, the total length of each dimension is equal to the number of fields contained in each dimension; wherein the determining module 504 is further configured to determine a difference between a total length of each dimension and a number of field repetitions of the same element in each dimension as a granularity of each dimension.
It is noted that the block diagrams shown in the above figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor terminal devices and/or microcontroller terminal devices.
Fig. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment. It should be noted that the electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic apparatus 600 includes a Central Processing Unit (CPU) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present disclosure are executed when the computer program is executed by a Central Processing Unit (CPU) 601.
It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, terminal device, or apparatus, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, terminal device, or apparatus. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, terminal device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, which may be described as: a processor includes a transmitting unit, an obtaining unit, a determining unit, and a first processing unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the sending unit may also be described as a "unit sending a picture acquisition request to a connected server".
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer-readable storage medium carries one or more programs that, when executed by one of the electronic devices, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 1.
According to an aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided in the various alternative implementations of the above-described embodiments.
It is to be understood that any number of elements in the drawings of the present disclosure are by way of example and not by way of limitation, and that any nomenclature is used for distinction only, and not by way of limitation.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A data processing method, comprising:
performing multiple random sampling on a data model of a data warehouse to obtain multiple sampling sample data;
determining target sample data according to the cost values of the plurality of sample data;
coding each element of each field of each dimension in the target sample data to obtain a coding sequence of each element;
determining the number of field repetition with the same elements in each dimension according to the coding sequence of each element;
the granularity of each dimension is determined according to the total length of each dimension and the number of field repetitions with the same element in each dimension.
2. The method of claim 1, wherein determining target sample data in dependence on cost values for the plurality of sample data comprises:
determining the information entropy of each column of each sample data;
determining the weighted information entropy of each sample data according to the information entropy and the attenuation factor of each column of each sample data;
determining a punishment item of each sample data according to the median and the minimum of the information entropy of each column of each sample data;
determining the cost value of each sample data according to the weighting information entropy of each sample data and the penalty item of each sample data;
and determining the sample data with the largest cost value as the target sample data.
3. The method according to claim 2, wherein determining a penalty term for each sample data according to the median and minimum of the information entropy of each column of said each sample data, comprises:
and taking the reciprocal of the difference between the median and the minimum of the information entropy of each column of each sample data sample to obtain a penalty item of each sample data sample.
4. The method of claim 2, wherein determining the cost value for each sample data based on the weighted entropy of the sample data and the penalty term for each sample data comprises:
and taking the sum of the weighting information entropy of each sample data and the punishment item of each sample data as the cost value of each sample data.
5. The method according to claim 1, wherein each element comprises a first element, the coded sequence comprises codes, the number of codes in the coded sequence is equal to the number of fields contained in each dimension, each dimension comprises at least one column;
encoding each element of each field of each dimension in the target sample data to obtain an encoding sequence of each element, including:
determining a target element which is the same as the first element in the column where the first element is located;
determining a coding bit corresponding to the target element in a coding sequence of the first element;
and setting the codes of the coding bits in the coding sequence of the first element to be 1, and setting the rest codes to be 0 to obtain the coding sequence of the first element.
6. The method of claim 5, wherein each dimension comprises a first column and a second column;
determining the number of field repetitions with the same element in each dimension according to the coding sequence of each element, wherein the determining comprises the following steps:
performing a phase-and-sum calculation on the coding sequence of each element in the first column and the coding sequence of each element in the second column in each dimension to obtain a plurality of intermediate coding sequences;
determining the intermediate coding sequences with the number of codes being 1 and larger than 1 as target intermediate coding sequences;
and counting the number of the target intermediate coding sequences to obtain the number of field repetition with the same element in each dimension.
7. The method of claim 1, wherein the total length of each dimension is equal to the number of fields contained in each dimension;
determining the granularity of each dimension according to the total length of each dimension and the number of field repetitions with the same element in each dimension, wherein the determining comprises the following steps:
the difference between the total length of each dimension and the number of field repetitions of the same element in each dimension is determined as the granularity of each dimension.
8. A data processing apparatus, characterized by comprising:
the sampling module is used for randomly sampling the data model of the data warehouse for multiple times to obtain multiple sampling data;
the determining module is used for determining target sample data according to the cost values of the plurality of sample data;
the encoding module is used for encoding each element of each field of each dimension in the target sample data to obtain an encoding sequence of each element;
the determining module is further used for determining the number of field repetition with the same element in each dimension according to the encoding sequence of each element;
the determining module is further configured to determine a granularity of each dimension according to a total length of each dimension and a number of field repetitions having the same element in each dimension.
9. An electronic device, comprising:
at least one processor;
storage means for storing at least one program which, when executed by the at least one processor, causes the at least one processor to carry out the method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111404573.7A 2021-11-24 2021-11-24 Data processing method and device, electronic equipment and storage medium Active CN114116920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111404573.7A CN114116920B (en) 2021-11-24 2021-11-24 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111404573.7A CN114116920B (en) 2021-11-24 2021-11-24 Data processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114116920A CN114116920A (en) 2022-03-01
CN114116920B true CN114116920B (en) 2022-12-30

Family

ID=80371963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111404573.7A Active CN114116920B (en) 2021-11-24 2021-11-24 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114116920B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920635A (en) * 2018-06-27 2018-11-30 平安科技(深圳)有限公司 A kind of method and device of data encoding analysis
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127293B2 (en) * 2015-03-30 2018-11-13 International Business Machines Corporation Collaborative data intelligence between data warehouse models and big data stores
US20190130226A1 (en) * 2017-10-27 2019-05-02 International Business Machines Corporation Facilitating automatic handling of incomplete data in a random forest model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920635A (en) * 2018-06-27 2018-11-30 平安科技(深圳)有限公司 A kind of method and device of data encoding analysis
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium

Also Published As

Publication number Publication date
CN114116920A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
CN110020427B (en) Policy determination method and device
CN112035753B (en) Recommendation page generation method and device, electronic equipment and computer readable medium
CN114049072B (en) Index determination method and device, electronic equipment and computer readable medium
CN112488297A (en) Neural network pruning method, model generation method and device
CN114116920B (en) Data processing method and device, electronic equipment and storage medium
CN111784246B (en) Logistics path estimation method
CN113434436B (en) Test case generation method and device, electronic equipment and storage medium
CN113128696A (en) Distributed machine learning communication optimization method and device, server and terminal equipment
CN113657552A (en) Data processing method and device, electronic equipment and storage medium
CN112860999B (en) Information recommendation method, device, equipment and storage medium
CN113094415B (en) Data extraction method, data extraction device, computer readable medium and electronic equipment
CN114926234A (en) Article information pushing method and device, electronic equipment and computer readable medium
CN113850523A (en) ESG index determining method based on data completion and related product
Leung et al. Image segmentation using maximum entropy method
CN111784377A (en) Method and apparatus for generating information
CN117748499B (en) Topology structure identification method and device for low-voltage area based on connection relation vector
CN111746992A (en) AGV-based automatic warehouse goods storage position determination method and device
CN115098793B (en) User portrait analysis method and system based on big data
CN110889462B (en) Data processing method, device, equipment and storage medium
CN109614328B (en) Method and apparatus for processing test data
CN114697322B (en) Data screening method based on cloud service processing
CN116501993B (en) House source data recommendation method and device
CN108984556B (en) Method, apparatus and computer-readable storage medium for data processing
CN116822475A (en) Processing method, device, equipment and medium of form data
CN114691743A (en) Standardized evaluation method and device of data model and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220301

Assignee: Tianyiyun Technology Co.,Ltd.

Assignor: CHINA TELECOM Corp.,Ltd.

Contract record no.: X2024110000020

Denomination of invention: Data processing methods, devices, electronic devices, and storage media

Granted publication date: 20221230

License type: Common License

Record date: 20240315

EE01 Entry into force of recordation of patent licensing contract