CN111563109A - Radix statistics method, apparatus, system, device and computer readable storage medium - Google Patents

Radix statistics method, apparatus, system, device and computer readable storage medium Download PDF

Info

Publication number
CN111563109A
CN111563109A CN202010339945.1A CN202010339945A CN111563109A CN 111563109 A CN111563109 A CN 111563109A CN 202010339945 A CN202010339945 A CN 202010339945A CN 111563109 A CN111563109 A CN 111563109A
Authority
CN
China
Prior art keywords
hash
data
bitmap
target data
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010339945.1A
Other languages
Chinese (zh)
Other versions
CN111563109B (en
Inventor
杜红光
罗华林
何凯
夏春伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202010339945.1A priority Critical patent/CN111563109B/en
Publication of CN111563109A publication Critical patent/CN111563109A/en
Application granted granted Critical
Publication of CN111563109B publication Critical patent/CN111563109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a radix statistical method, a device, a system, equipment and a computer readable storage medium. The method comprises the steps that target data and dimension data corresponding to the target data are obtained at a data node side; performing hash calculation on the target data by using a hash algorithm to obtain a hash value corresponding to the target data; generating a bitmap array for the dimensional data corresponding to the target data by using a bitmap algorithm; utilizing a preset hash table to correspondingly store a hash value corresponding to the target data and a bitmap array generated for the dimensional data corresponding to the target data; and sending the hash table to a preset computing node. Receiving hash tables respectively sent by a plurality of data nodes at a computing node side; and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to the bit positions in the combined hash tables. The invention can save the storage space of data and rapidly and accurately complete the radix statistics in a two-layer Hash mode.

Description

Radix statistics method, apparatus, system, device and computer readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a radix statistics method, apparatus, system, device, and computer-readable storage medium.
Background
Cardinality statistics are used to count the number of non-repeating element data in a batch of data. Radix statistics is commonly used in application scenarios of calculating an independent user number (UV), independent numerical scores of dimensions, and the like. In an actual production environment, a scenario requiring a statistically accurate radix is often encountered. For example: after the access entries of the website are adjusted in two ways (for example, the entries have different appearances), an AB experiment needs to be performed on the two entries, and in the AB experiment, the access amounts of independent users corresponding to the two entries in a preset time period need to be accurately counted, so as to perform data analysis on the two entries before online.
Currently, in a distributed computing environment, a commonly used radix statistical algorithm is the HyberLogLog algorithm.
For the HyberLogLog algorithm, since the order of magnitude of element data and the storage space are in a direct proportion relationship, the HyberLogLog algorithm compresses the element data when the element data is stored in order to mark enough element data in the limited storage space, but the compression processing causes data loss of the element data, and the calculation accuracy of the HyberLogLog algorithm is low due to the data loss of the element data, and the error is large, so that the requirement of accurate statistics cannot be met.
Disclosure of Invention
The embodiment of the invention aims to provide a radix statistical method, a radix statistical device, a radix statistical system, a radix statistical device and a computer readable storage medium, which aim to solve the problem of low statistical accuracy of the existing radix statistical mode.
In view of the above technical problems, the specific technical solution of the embodiment of the present invention is as follows:
in a first aspect of the present invention, there is provided a radix statistics method, which includes the steps performed on a data node side, including: acquiring target data and dimension data corresponding to the target data; performing hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data; generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics; correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by using a preset hash table; and sending the hash table to a preset computing node so that the computing node executes radix statistical processing according to the received bitmap array in the hash table and the bit.
The dimension data corresponding to the target data comprises: a plurality of dimensional values of the target data; generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm, including: inquiring a mapping relation table which is preset for the data node, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in the bitmap array; according to the dimension elements corresponding to the dimension values in the dimension data, bits mapped by the dimension elements corresponding to the dimension values in the bitmap array identify first bit values, and other bits identify second bit values, so that the bitmap array corresponding to the dimension data is obtained.
The method for storing the hash value corresponding to the target data and the bitmap array corresponding to the dimensional data of the target data by using the preset hash table comprises the following steps: and in a Java language environment, utilizing a Trove package to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data into the hash table.
After the using a preset hash table, correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data, and before sending the hash table to a preset computing node, the method further includes: inquiring whether the same hash value exists in the hash table; and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
In a second aspect of the present invention, there is also provided a radix statistics method, performed on a computing node side, including: receiving hash tables respectively sent by a plurality of data nodes; wherein, a hash value and a bitmap array are correspondingly stored in each hash table; the hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics; and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to bit positions in the combined hash tables.
In the hash table obtained after the merging, performing radix statistics on the plurality of bitmap arrays according to bits, including: inquiring whether the same hash value exists in the hash table obtained after the combination; if the same hash value exists in the hash table, performing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value; and in the hash table obtained after the combination, performing radix number statistical processing on the aggregated bitmap array corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to bit positions.
The merging the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to the bit positions in the merged hash tables, includes: dividing the plurality of hash tables into a plurality of hash table sets which are not repeated mutually; merging the hash tables in the hash table sets aiming at each hash table set, and performing radix number statistical processing on the bitmap arrays according to bit positions in the merged hash tables to obtain radix number statistical results corresponding to the hash table sets; and after base number statistical results corresponding to the plurality of hash table sets are obtained, performing aggregation processing on the base number statistical results corresponding to the plurality of hash table sets to obtain a final base number statistical result.
Wherein, before the merging the plurality of hash tables, the method further comprises: acquiring a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in a preset bitmap array; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes so that the number of bits of the bitmap arrays in the hash tables is equal and the dimension elements of bit mapping at corresponding positions are the same. In a third aspect of the present invention, there is further provided a radix statistics apparatus, arranged on a data node side, including: the acquisition module is used for acquiring target data and dimension data corresponding to the target data; the first hash module is used for carrying out hash calculation on the target data by utilizing a preset hash algorithm so as to obtain a hash value corresponding to the target data; the bitmap generation module is used for generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics; the second hash module is used for correspondingly storing a hash value corresponding to the target data and a bitmap array generated for the dimensional data corresponding to the target data by utilizing a preset hash table; and the sending module is used for sending the hash table to a preset computing node so that the computing node can execute radix statistics processing according to the received bitmap array in the hash table and bit positions. In a fourth aspect of the present invention, there is also provided a radix statistics apparatus, disposed on a side of a computing node, including: the receiving module is used for receiving hash tables respectively sent by a plurality of data nodes; wherein, a hash value and a bitmap array are correspondingly stored in each hash table; the hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics; and the counting module is used for combining the hash tables respectively sent by the data nodes and executing radix number counting processing on the bitmap arrays according to bit positions in the combined hash tables.
In a fifth aspect of the present invention, there is also provided a radix statistics system, in which: the system comprises a plurality of data nodes and computing nodes respectively connected with the data nodes; each of the data nodes includes: the first hash interface and the second hash interface are connected with each other; the first hash interface is used for acquiring target data and dimension data corresponding to the target data; performing hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data; the second hash interface is used for generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics; correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by using a preset hash table; sending the hash table to a preset computing node; the computing node is used for receiving hash tables respectively sent by a plurality of data nodes; and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to bit positions in the combined hash tables.
In a sixth aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; and a processor configured to implement any of the above-described steps of the radix statistics method executed on the data node side or any of the above-described steps of the radix statistics method executed on the computation node side when executing a program stored in the memory.
In a seventh aspect of the present invention, there is further provided a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform any of the above-mentioned steps of the cardinality statistics method performed on a data node side, or to implement any of the above-mentioned steps of the cardinality statistics method performed on a compute node side.
In an eighth aspect of the embodiments of the present invention, there is also provided a computer program product containing instructions, which when run on a computer, causes the computer to perform any of the above-mentioned steps of the cardinality statistical method performed on the data node side, or to implement any of the above-mentioned steps of the cardinality statistical method performed on the compute node side.
The radix statistical method, the device, the system, the equipment and the computer readable storage medium provided by the embodiment of the invention provide an accurate and rapid radix statistical method in a double-layer hash mode. Further, the embodiment of the invention firstly converts the target data into the hash value so as to save the storage space of the data; then, the data of the dimension elements needing the cardinality statistics is represented in a bitmap array form, so that the data of the dimension elements cannot be lost, the effect of compressing the data amount corresponding to the dimension elements is achieved, the cardinality statistics accuracy is improved, and the data storage space can be further saved; and finally, storing the hash value of the target data and the bitmap array of the dimensional data corresponding to the target data into a hash table, and improving the data retrieval efficiency in the radix statistics process by utilizing the characteristic that the hash table can be quickly inquired, so that the radix statistics can be quickly and accurately completed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow diagram of a radix statistics method performed at a data node side according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a cardinality statistics method performed at a compute node side according to an embodiment of the invention;
fig. 3 is a structural diagram of a radix statistics apparatus provided at a data node side according to an embodiment of the present invention;
FIG. 4 is a block diagram of a radix statistics apparatus disposed at a compute node side according to an embodiment of the present invention;
FIG. 5 is a block diagram of a radix statistics system according to an embodiment of the present invention;
FIG. 6 is a detailed block diagram of a radix statistics system according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating radix statistics performed by the radix statistics system according to an embodiment of the invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the embodiment of the present invention clearer, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
The embodiment of the invention provides a cardinality statistical method executed on a data node side. Fig. 1 is a flowchart illustrating a radix statistics method performed on a data node side according to an embodiment of the present invention.
Step S110, acquiring target data and dimension data corresponding to the target data.
Target data refers to objects of cardinal statistics interest. For example: the target data is an independent user.
The dimension data is element information that requires cardinality statistics for an object of interest in cardinality statistics.
In this embodiment, the dimension data corresponding to the target data includes: a plurality of dimension values of the target data. Further, the dimension data corresponding to the target data includes a plurality of dimensions of the target data and a dimension value of each dimension.
For example: the target data may be a Unique code (Unique ID) for the user, and the Unique ID of the user may be used to represent an independent user.
For another example: the plurality of dimensions may include: gender, age, region, version number, page number, etc.; the dimension values corresponding to the plurality of dimensions may include: male, 20 years old, beijing, version V1, page number P1, etc.
If the number of the target data is multiple, multiple target data and the corresponding dimension data of each target data can be obtained. Further, when the user logs in and accesses the website through the data node, the data node stores the access record of the user, and the access record includes but is not limited to: login information of the user and webpage information accessed by the user. Login information includes, but is not limited to: the Unique ID, gender, age and region of the user. The accessed webpage information includes but is not limited to: version number of web page, page number.
Step S120, performing hash calculation on the target data by using a preset hash algorithm, so as to obtain a hash value corresponding to the target data.
And if the number of the target data is multiple, performing hash calculation on each target data by using a preset hash algorithm so as to obtain a hash value corresponding to each target data.
In this embodiment, the types of hash algorithms include, but are not limited to: a non-encrypted hash algorithm and an encrypted hash algorithm. Further, compared with the encryption type hash algorithm, the non-encryption type hash algorithm omits the process of encrypting and decrypting the hash value, and can reduce the operation complexity and the operation time consumption of the hash calculation, so the non-encryption type hash algorithm is preferred for the hash algorithm of the embodiment.
In this embodiment, the unencrypted type hash algorithm may adopt a 64-bit hash algorithm (xxhash64 algorithm), and the xxhash64 algorithm may convert the target data into a long integer type value (long type value).
The embodiment of the invention performs the Hash algorithm on the target data and can compress the target data. For example: the xxhash64 algorithm is used for converting target data into a Long type numerical value, the data can be converted into the Long type numerical value of 8 bytes from a character string of 32 bytes, the realized compression ratio is 4, and the collision rate (data collision) of the converted numerical value is 0 under the 10 hundred million order of magnitude.
Step S130, generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; and each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics.
At least one bit is included in the bitmap array. Each bit in the bitmap array maps a dimension element to be subjected to base statistics.
And the dimension element is used for representing one dimension value or a dimension value combination formed by a plurality of dimension values. The dimension elements may be set according to cardinality statistical requirements. In this way, when the bitmap array is correspondingly generated for the dimension data, a preset bit value may be set at a bit mapped by the dimension element, and the bit value may represent whether the dimension element (a dimension value or a dimension value combination) exists in the dimension data. By the method, a large amount of element data (dimension data) does not need to be compressed and stored, data loss of the element data does not occur, whether the dimension elements exist in the dimension data can be accurately known through the bitmap array, and the method is a key for performing accurate cardinality statistics on the dimension elements.
Specifically, if the number of the target data is multiple, a preset bitmap algorithm is used to generate bitmap arrays for the dimensional data corresponding to each target data.
The bitmap array comprises one or more bits, and the dimension elements specifically mapped by each bit can be embodied by a mapping relation table.
And the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in the bitmap array. That is, the bitmap array includes one or more bits, each bit has a corresponding meaning, and each bit is an element to be subjected to radix statistics.
Respectively generating bitmap arrays for the dimensional data corresponding to each target data by using a preset bitmap algorithm and a mapping relation table correspondingly set for the data nodes, wherein the specific steps are as follows: inquiring a mapping relation table which is preset for corresponding data nodes, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; according to the dimension elements corresponding to the multiple dimension values in the dimension data, bits mapped by the dimension elements corresponding to the multiple dimension values in the bitmap array identify first bit values, and other bits identify second bit values, so that the bitmap array corresponding to the dimension data is generated. Further, since the dimension elements are dimension values or a combination of dimension values, according to a plurality of dimension values of the target data included in the dimension data, the dimension elements to which the plurality of dimension values of the target data can correspond may be queried first, and then bits to which the plurality of dimension values of the target data can correspond may be queried in the mapping relationship table.
The first bit value indicates that a dimension element of the dimension data for which the current bit map exists. Wherein the first bit value may be 1.
The second bit value represents a dimension element in the dimension data for which the current bit map does not exist. Wherein the second bit value may be 0.
For example: the dimension elements are combinations of dimension values, the first bit value may be 1, and the second bit value may be 0. After modifying the website version, if the independent users of the browsing version V1 and the page number P1 are desired to know, the version V1 and the page number P1 can be set as the first dimension combination value; if it is desired to know the number of independent users viewing version V2 and page number P7, version V2 and page number P7 may be set to the second dimension value combination. Thus, the mapping relation table can be used to record the dimension value combination of the statistics to be radix of each bitmap in the bitmap array, such as: the first bit maps a first combination of dimension values and the second bit maps a second combination of dimension values. According to the obtained dimension data of the target data, the mapping relation table is inquired, and the dimension value combination corresponding to a plurality of dimension values in the dimension data can be determined, such as: in the dimension data, the dimension value of the version number dimension is version V1, the dimension value of the page number dimension is three dimension values of page number P1, page number P2 and page number P3, the plurality of dimension values (version V1, page number P1, page number P2 and page number P3) can be determined to correspond to the first dimension value combination (version V1 and page number P1), and the plurality of dimension values cannot correspond to the second dimension value combination because the version V2 and the page number P7 do not exist; according to the dimension value combinations corresponding to a plurality of dimension values in the dimension data, in the bitmap array, a first bit mark 1 is mapped in a first dimension value combination, and a second bit mark 0 is mapped in a second dimension value combination.
Step S140, using a preset hash table, to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data.
And if the number of the target data is multiple, correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by utilizing a preset hash table aiming at each target data.
And in a Java language environment, utilizing a Trove package to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data into the hash table.
Specifically, under a big data computing framework, a runtime environment of a Java Virtual Machine (JVM) is mainly used. Since Java is Object-oriented (Object), in the runtime environment of the JVM, the hash table by default supports Object-type data, but not underlying-type data. Basic types, including but not limited to: int type, Double type, Long type.
In this embodiment, the Trove packet provides the function of a custom hash table. In a Java language environment, a basic data structure is constructed through a Trove package, and the basic data structure can be recorded as DoubleLevelHashData. Included in the radix data structure are, but not limited to: a custom hash table TLongObjectMap < Data, byte [ ] >. Further, the basic data structure may further include: and a mapping relation table correspondingly set for the data nodes.
Data represents a key in the hash table TLongObjectMap. In the present embodiment, a hash value used by Data to store target Data is defined. The hash value of the target data may be a Long type value.
byte [ ] represents the key value of the key in the hash table TLongObjectMap. In the present embodiment, byte [ ] is defined for storing a bitmap array, that is, a dimension element existing in the marked dimension data.
Further, in this embodiment, a basic data structure constructed by a Trove packet is used to replace a Java native hash table, so that in a Java language environment, the hash table can support basic type data, and the Trove packet adopts an open fixed value method, which can reduce consumption of linked list reference, and the Trove packet can avoid an additional storage space (such as an index value) generated by packaging the basic type data into non-basic type data, and under a 0.75 packing factor, a compression ratio can reach 2.16.
Step S150, the hash table is sent to a preset computing node, so that the computing node executes cardinality statistical processing according to the received bitmap array in the hash table and the bit.
In this embodiment, in order to reduce duplicate data in the hash table, it may be queried whether the same hash value exists in the hash table before sending the hash table to the computing node; and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
The same hash value, including: and respectively carrying out hash calculation on different target data to obtain conflicting hash values, and/or respectively carrying out hash calculation on the same target data to obtain the same hash value.
The aggregation processing may be performed on a plurality of bitmap arrays corresponding to the same hash value by performing an OR (OR) operation on corresponding bits in the plurality of bitmap arrays.
The embodiment of the invention provides an accurate and rapid radix statistical method in a double-layer hash mode. Further, the embodiment of the invention firstly converts the target data into the hash value so as to save the storage space of the data; then, the data of the dimension elements needing the cardinality statistics is represented in a bitmap array form, so that the data of the dimension elements cannot be lost, the effect of compressing the data amount corresponding to the dimension elements is achieved, the cardinality statistics accuracy is improved, and the data storage space can be further saved; and finally, storing the hash value of the target data and the bitmap array of the dimensional data corresponding to the target data into a hash table, and improving the data retrieval efficiency in the radix statistics process by utilizing the characteristic that the hash table can be quickly inquired so as to quickly finish the radix statistics.
Because a plurality of nodes exist in the distributed environment and form a communication network, a single node can only embody the local characteristics and can not embody the global characteristics. For example: users in area a access the website through data node 1, and users in area B access the website through data node 2. Thus, embodiments of the present invention specify data nodes and compute nodes in a distributed environment. And combining the hash tables of the data nodes, and performing radix statistics based on the combined hash tables, so that the radix statistics result can embody the global characteristics. Of course, if only the radix statistics is desired to be performed on the local part, the radix statistics processing may be performed according to the bit in the bitmap array in the hash table of the data node on which the radix statistics is required, so as to obtain the radix statistics result.
In a distributed environment, aiming at the above cardinality statistical method executed on the data node side, the embodiment of the present invention further provides a cardinality statistical method executed on the compute node side. Fig. 2 is a flowchart illustrating a radix statistics method performed at a compute node according to an embodiment of the present invention.
Step S210, receiving hash tables respectively sent by the plurality of data nodes.
And correspondingly storing a hash value and a bitmap array in each hash table. The hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for the dimensional data corresponding to the target data by using a preset bitmap algorithm. And each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics.
If the number of the target data on the data node side is multiple, multiple groups of hash values and bitmap arrays are correspondingly stored in the hash table; in each group of hash value and bitmap data, the hash value is obtained by performing hash calculation on one target data in a plurality of target data by using a preset hash algorithm, and the bitmap array is generated for the dimensional data corresponding to the target data by using a preset bitmap algorithm.
And the dimension element is used for representing one dimension value or a dimension value combination formed by a plurality of dimension values.
A bitmap array comprising: at least one bit, each bit mapping a dimension value or a combination of dimension values. And combining the dimension value or the dimension value of each bitmap into a dimension element to be subjected to base statistics.
Step S220, merging the hash tables respectively sent by the multiple data nodes, and performing radix statistics on the multiple bitmap arrays according to bits in the merged hash tables.
Since the hash tables from multiple Data nodes have the same Data structure, i.e. the hash tables are all TLongObjectMap < Data, byte [ ] > Data structure, the hash tables from multiple Data nodes can be merged into one total hash table or into multiple hash tables in batches.
The batch combination into a plurality of hash tables means that: the plurality of hash tables are divided into a plurality of hash table sets which are not repeated. For example: the hash tables of the data nodes 1 to 3 are divided into a set of hash tables, and the hash tables from the data nodes 4 to 6 are divided into a set of hash tables. Merging the hash tables in the hash table sets aiming at each hash table set, and performing radix number statistical processing on the bitmap arrays according to bit positions in the merged hash tables to obtain radix number statistical results corresponding to the hash table sets; and after base number statistical results corresponding to the plurality of hash table sets are obtained, performing aggregation processing on the base number statistical results corresponding to the plurality of hash table sets to obtain a final base number statistical result.
Furthermore, because the same target data may have dimension data at different data nodes, for example, the same user logs in the same website through different data nodes, whether the same hash value exists or not can be queried in the hash table obtained after merging; if the same hash value exists in the hash table, performing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value; and in the hash table obtained after the combination, performing radix statistical processing on the aggregated bitmap array corresponding to the same hash value and bitmap arrays corresponding to the rest hash values according to the bit. The polymerization operation may be an ORing operation.
Performing radix statistics processing according to the bits, comprising: and summing the bit values of the corresponding bits in the bit map arrays to obtain a radix statistical result corresponding to the bit, wherein the radix statistical result is the radix statistical result of the dimension elements mapped by the bit.
In this embodiment, since the mapping relationship table is set for the data node, if different data nodes correspond to different mapping relationship tables, the mapping relationship table corresponding to each data node needs to be obtained before merging the multiple hash tables; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes so that the bit numbers of the bitmap arrays in the hash tables are equal and the dimension elements of the bit maps at the corresponding positions are the same.
For example: a bit is recorded in the mapping relation table of the data node 1, and the bit is mapped to a version V1; the mapping relation table of the data node 2 records two bits, namely a first bit mapping version V1 and a second bit mapping version V2; in the alignment process, one bit may be added to the bitmap array as a second bit in the hash table from the data node 1 and the version V2 is mapped, so that the number of bits in the bitmap array is the same and the dimension elements of each bitmap are the same for the hash table from the data node 1 and the hash table from the data node 2.
In this embodiment, at the data node side, the dimension elements are compressed into the bitmap array, and whether the dimension elements exist in the dimension data is marked in the bitmap array, so that the dimension elements are not lost, and the dimension elements appearing in the dimension data can be accurately known in the dimension array, so that the cardinality statistics with higher accuracy can be performed on the calculation node side based on the bitmap array.
Since radix statistics based on the bitmap algorithm can only be performed on Int-type data, if the double-layer hash mode of the embodiment of the present invention is not used, a global data dictionary is required to be maintained between nodes, the global data dictionary is used for recording Int-type element data corresponding to string-type element data, the string-type data is converted into Int-type data through the global data dictionary, and then radix statistics is performed on the converted Int-type data. According to the embodiment of the invention, a global data dictionary does not need to be maintained between the data node and the computing node, and the dimensional elements of hundred million-level target data are not required to be subjected to radix statistics by using the global data dictionary, so that the step of converting data into Int type by inquiring the global data dictionary is omitted, the problem of low statistical accuracy of radix statistics caused by performing radix statistics by using the global data dictionary which is not maintained timely can be avoided, accurate radix statistics can be realized by using the double-layer hash mode of the embodiment of the invention, and the used storage space in the radix statistics process is small. Further, the cardinality statistics method of the embodiment may be applied in a big data scenario, and may perform offline cardinality statistics or online cardinality statistics.
The embodiment of the invention provides a cardinality statistical device arranged on a data node side. Fig. 3 is a block diagram of a radix statistics apparatus disposed at a data node side according to an embodiment of the present invention.
A radix statistic device provided on a data node side includes: an obtaining module 310, a first hashing module 320, a bitmap generating module 330, a second hashing module 340, and a sending module 350.
The obtaining module 310 is configured to obtain target data and dimension data corresponding to the target data.
The first hash module 320 is configured to perform hash calculation on the target data respectively by using a preset hash algorithm, so as to obtain hash values corresponding to the target data respectively.
The bitmap generation module 330 is configured to generate bitmap arrays for the dimensional data corresponding to the target data by using a preset bitmap algorithm; and each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics.
The second hash module 340 is configured to utilize a preset hash table to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data.
A sending module 350, configured to send the hash table to a preset computing node, so that the computing node performs radix statistics processing according to a received bitmap array in the hash table and according to a bit.
The functions of the apparatus according to the embodiment of the present invention have been described in the above method embodiments, so that reference may be made to the related descriptions in the foregoing embodiments for details which are not described in the embodiment of the present invention, and further details are not described herein.
The embodiment of the invention also provides a cardinal number statistical device arranged on the side of the computing node. Fig. 4 is a block diagram of a radix statistics apparatus disposed at a side of a compute node according to an embodiment of the present invention.
Cardinality statistics apparatus provided on a side of a compute node, comprising: a receiving module 410 and a statistics module 420.
A receiving module 410, configured to receive hash tables sent by multiple data nodes respectively; wherein, a hash value and a bitmap array are correspondingly stored in each hash table; the hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimensional data corresponding to the target data by using a preset bitmap algorithm; and each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics.
And the counting module 420 is configured to merge hash tables respectively sent by the multiple data nodes, and perform radix statistics on the multiple bitmap arrays according to bits in the merged hash tables.
The functions of the apparatus according to the embodiment of the present invention have been described in the above method embodiments, so that reference may be made to the related descriptions in the foregoing embodiments for details which are not described in the embodiment of the present invention, and further details are not described herein.
The embodiment of the invention also provides a cardinal number statistical system. Fig. 5 is a block diagram of a cardinality statistics system according to an embodiment of the invention.
The cardinality statistical system comprises: a plurality of data nodes 510, and a compute node 510. The compute nodes 510 are connected to each of the data nodes 510, respectively. Wherein, each data node 510 includes: a first hash interface 511 and a second hash interface 512 connected to each other. The structure of only one data node 510 is shown in fig. 5.
The first hash interface 511 is configured to obtain target data and dimension data corresponding to the target data; and respectively carrying out hash calculation on the target data by utilizing a preset hash algorithm so as to respectively obtain hash values corresponding to the target data. Further, the first hash interface 511 may be an xxhash64 algorithm conversion service interface, and the first hash interface 511 may convert data of various data types (e.g. string-type data) into Long-type values (8 bytes). Preliminary compression of the data may be achieved through the first hash interface 511.
A second hash interface 512, configured to generate, by using a preset bitmap algorithm, bitmap arrays for the dimensional data corresponding to the target data respectively; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics; correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by using a preset hash table; and sending the hash table to a preset computing node 520.
The second hash interface 512 is implemented based on the hash table TLongObjectMap, the hash value corresponding to the target data is the key value in the hash table, and the bitmap array generated for the dimensional data corresponding to the target data is the key value of the hash value. Further compression of the data may be achieved by the second hash interface 512, and the hash table set by the second hash interface 512 supports fast retrieval of whether a key value is present.
The computing node 520 is configured to receive hash tables sent by a plurality of data nodes respectively; and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to bit positions in the combined hash tables.
Further, as shown in fig. 6, the data node 510 may further include: a mapping module 513. In the computing node 520 may include: a data merge module 521 and a data aggregation module 522.
And the mapping module 513 is configured to store a mapping relationship table correspondingly set for the data node 510. The bits at which the dimension elements are mapped in the bitmap array may be queried at the mapping module.
A data merging module 521, configured to obtain a mapping relationship table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in a preset bitmap array; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes so that the number of bits of the bitmap arrays in the hash tables is equal and the dimension elements of bit mapping at corresponding positions are the same. Further, after aligning bitmap arrays in the hash tables TLongObjectMap sent by the multiple data nodes, the multiple hash tables TLongObjectMap are merged to obtain a new doubllelevelhashdata.
And a data aggregation module 522, configured to perform radix statistics on the plurality of bitmap arrays according to bits in the hash table obtained after merging. Further, the data aggregation module may perform aggregation calculation on the doubllelevelhashsdata newly generated by the data merging module, where the aggregation calculation includes: traversing all the dimension elements, obtaining the bit mapped by each dimension element, traversing the marks of the bit mapped by the dimension element of all the bitmap arrays in the TLongObjectmap aiming at each dimension element, and counting the number of the marks as 1 as the cardinal number statistical result of the dimension element.
For example: fig. 7 is a schematic diagram illustrating cardinality statistics performed by the cardinality statistics system according to an embodiment of the invention. The cardinality statistical system comprises: two data nodes 510 and one compute node 520.
The independent user access volume in the AB experiment is determined based on the cardinality statistics system. The AB experiment is to perform different improvements on one page (page number P1) in the website in advance, wherein the improvement 1 corresponds to the version V1, and the improvement 2 corresponds to the version V2. Users in different areas access the website through two data nodes 510 (data node 1 and data node 2), respectively. Both data nodes 510 may return different versions of the page for different users.
The following description takes the data node 1 as an example:
the data node 510 constructs a hash table TLongObjectMap through a Trove packet, where KEYS is used for storing a plurality of key VALUES, VALUES is used for storing a plurality of key VALUES, and the key VALUES are bitmap arrays.
The data node 510 sets and stores a mapping table in the mapping module 513 that specifies that each bit in the bitmap array maps a combination of dimensional values. In FIG. 6, the first bitmap version V1 of the bitmap array is combined with the dimension values of page number P1, the second bitmap version V2 is combined with the dimension values of page number P1, and the remaining bitmaps are combined with other dimension values.
A first hash interface 511 of the data node 510 acquires a plurality of target data and dimension data corresponding to each target data; carrying out Hash calculation on each target data in sequence; the second hash interface 512 stores the calculated hash value in a hash table. Two previous target data are taken as examples: the first hash interface 511 performs hash calculation on the Unique ID "xxxxxx", and the second hash interface 512 stores the obtained hash value "100001" as a first key; dimension data of the Unique ID "xxxxxx" is version V1 and page number P1, the second hash interface 512 determines that version V1 and page number P1 map a first bit by querying a mapping relation table, queries a bitmap array corresponding to the first key in the TLongObjectMap, and identifies 1 in the first bit and 0 in the second bit; the first hash interface 511 performs hash calculation on the UniqueID "yyyyyy", and the second hash interface 512 stores the obtained hash value "200090" as a second key; dimension data of the Unique ID "yyyyyyy" is version V2 and page number P1, the second hash interface 512 determines that version V2 and page number P1 map a second bit by querying the mapping relation table, queries a bitmap array corresponding to the second key in the TLongObjectMap, and identifies 0 in the first bit and 1 in the second bit of the bitmap array.
The second hash interface 512 of the data node 510 sends the TLongObjectMap to the compute node after storing the hash values and the bitmap array corresponding to the plurality of target data into the TLongObjectMap.
The data node 2 refers to the data node 1 for performing the process, which is not described herein.
The computing node 520 receives the TLongObjectMap sent by each of the two data nodes 510. The data merge module 510 of the compute node 510 compares the mapping relationship tables of the two data nodes 510 and determines that the bit numbers in the bitmap arrays of the two tlongobjectmaps from the two data nodes 510 and the dimension value combinations of the corresponding bit maps are the same, so the data merge module 510 can directly merge the two tlongobjectmaps together.
The data aggregation module 522 of the computing node 520 traverses each bitmap array in the VALUES, and performs summation operation on the bit VALUES of corresponding bits to obtain a summation result corresponding to the bits, where the summation result corresponding to the bits is a radix statistical result corresponding to the combination of the dimension VALUES of the bitmap. Namely: the summation result corresponding to the first bit is the number of independent users accessing the version V1 and the page number P1, and the summation result corresponding to the second bit is the number of independent users accessing the version V2 and the page number P1.
In this embodiment, in the process of radix statistics, the efficiency of identifying whether one dimension element exists determines the efficiency of radix statistics, and in this embodiment, the data of the dimension element is queried and stored in a hash table manner, so that the time complexity of access of the dimension element can be ensured to be O (1), and the efficiency of radix statistics is high.
In this embodiment, in the process of radix statistics, radix statistics is often performed on a plurality of dimension elements, and if each dimension element is directly stored by using a hash table, the storage space occupies too much, so that the embodiment adopts a byte-type array structure, and stores a mark of whether the dimension element exists in each bit of the array by using the principle of BitMap, and this way can effectively compress the data amount corresponding to the dimension element, and the storage space of hundred million-level data of 128 dimension values is only 4.4 GB.
In this embodiment, the dimension elements are compressed into the bitmap array, that is, whether the dimension elements exist in the dimension data is marked in the bitmap array, so that data loss of the dimension elements does not occur, and therefore, the accuracy of performing radix statistics based on the bitmap array is high. By the embodiment, accurate cardinality statistics can be performed on AB experiments, Daily active user number (DAU for short), de-weight display amount and other application scenes needing accurate cardinality statistics.
An embodiment of the present invention further provides an electronic device, as shown in fig. 8, including a processor 810, a communication interface 820, a memory 830, and a communication bus 840, where the processor 810, the communication interface 820, and the memory 830 complete mutual communication through the communication bus 840.
A memory 830 for storing computer programs.
The processor 810 is configured to implement the above-described steps of the radix statistics method executed on the data node side or implement the above-described steps of the radix statistics method executed on the compute node side when executing the program stored in the memory 830.
The processor 810, when executing the program stored in the memory 830, is configured to implement the above-mentioned step of performing the radix statistics method on the data node side, and includes: acquiring target data and dimension data corresponding to each target data; respectively carrying out hash calculation on the target data by utilizing a preset hash algorithm so as to respectively obtain hash values corresponding to the target data; respectively generating bitmap arrays for the dimensional data corresponding to the target data by utilizing a preset bitmap algorithm; each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics; correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by using a preset hash table; and sending the hash table to a preset computing node so that the computing node executes radix statistical processing according to the received bitmap array in the hash table and the bit.
The dimension data corresponding to the target data comprises: a plurality of dimensional values of the target data; generating bitmap arrays for the dimensional data corresponding to the target data respectively by using a preset bitmap algorithm, including: inquiring a mapping relation table which is preset for the data node, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in the bitmap array; according to the dimension elements corresponding to the dimension values in the dimension data, bits mapped by the dimension elements corresponding to the dimension values in the bitmap array identify first bit values, and other bits identify second bit values, so that the bitmap array corresponding to the dimension data is generated.
After the using a preset hash table, correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data, and before sending the hash table to a preset computing node, the method further includes: inquiring whether the same hash value exists in the hash table; and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
The method for storing the hash value corresponding to the target data and the bitmap array corresponding to the dimensional data of the target data by using the preset hash table comprises the following steps: and in a Java language environment, utilizing a Trove package to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data into the hash table.
The hash algorithm is a non-encryption type hash algorithm.
The processor 810, when executing the program stored in the memory 830, is configured to implement the above-mentioned step of performing the cardinality statistics method on the side of the computing node, and includes: receiving hash tables respectively sent by a plurality of data nodes; wherein, a hash value and a bitmap array are correspondingly stored in each hash table; the hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics; and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to bit positions in the combined hash tables.
Wherein, before the merging the plurality of hash tables, the method further comprises: acquiring a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in a preset bitmap array; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes so that the number of bits of the bitmap arrays in the hash tables is equal and the dimension elements of bit mapping at corresponding positions are the same.
In the hash table obtained after the merging, performing radix statistics on the plurality of bitmap arrays according to bits, including: inquiring the same hash value in the hash table obtained after the combination; if the same hash value exists in the hash table, performing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value; and in the hash table obtained after the combination, performing radix number statistical processing on the aggregated bitmap array corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to bit positions.
The merging the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to the bit positions in the merged hash tables, includes: dividing the plurality of hash tables into a plurality of hash table sets which are not repeated mutually; merging the hash tables in the hash table sets aiming at each hash table set, and performing radix number statistical processing on the bitmap arrays according to bit positions in the merged hash tables to obtain radix number statistical results corresponding to the hash table sets; and after base number statistical results corresponding to the plurality of hash table sets are obtained, performing aggregation processing on the base number statistical results corresponding to the plurality of hash table sets to obtain a final base number statistical result.
The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform any of the above-described steps of the cardinality statistical method performed on a data node side or the steps of the cardinality statistical method performed on a compute node side.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the steps of the cardinality statistics method performed on the data node side or the cardinality statistics method performed on the compute node side described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (13)

1. A radix statistics method, characterized in that the steps executed on the data node side comprise:
acquiring target data and dimension data corresponding to the target data;
performing hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data;
generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics;
correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by using a preset hash table;
and sending the hash table to a preset computing node so that the computing node executes radix statistical processing according to the received bitmap array in the hash table and the bit.
2. The method of claim 1, wherein the dimensional data corresponding to the target data comprises a plurality of dimensional values of the target data;
generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm, including:
inquiring a mapping relation table which is preset for the data node, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in the bitmap array;
according to the dimension elements corresponding to the dimension values in the dimension data, bits mapped by the dimension elements corresponding to the dimension values in the bitmap array identify first bit values, and other bits identify second bit values, so that the bitmap array corresponding to the dimension data is obtained.
3. The method according to claim 1, wherein after the using a preset hash table, correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data, and before the sending the hash table to a preset computing node, further comprising:
inquiring whether the same hash value exists in the hash table;
and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
4. The method according to any one of claims 1 to 3, wherein the using a preset hash table to correspondingly store the hash value corresponding to the target data and the bitmap array corresponding to the dimension data of the target data comprises:
and in a Java language environment, utilizing a Trove package to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data into the hash table.
5. A cardinality statistics method, characterized in that the steps performed at the side of a compute node comprise:
receiving hash tables respectively sent by a plurality of data nodes; wherein, a hash value and a bitmap array are correspondingly stored in each hash table; the hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array is mapped with a dimension element to be subjected to base number statistics;
and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to bit positions in the combined hash tables.
6. The method according to claim 5, wherein performing radix statistics on a plurality of bitmap arrays according to bit in the merged hash table comprises:
inquiring whether the same hash value exists in the hash table obtained after the combination;
if the same hash value exists in the hash table, performing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value;
and in the hash table obtained after the combination, performing radix number statistical processing on the aggregated bitmap array corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to bit positions.
7. The method according to claim 5, wherein the merging the hash tables respectively sent by the plurality of data nodes, and performing radix statistics on the plurality of bitmap arrays according to bits in the merged hash table comprises:
dividing the plurality of hash tables into a plurality of hash table sets which are not repeated mutually;
merging the hash tables in the hash table sets aiming at each hash table set, and performing radix number statistical processing on the bitmap arrays according to bit positions in the merged hash tables to obtain radix number statistical results corresponding to the hash table sets;
and after base number statistical results corresponding to the plurality of hash table sets are obtained, performing aggregation processing on the base number statistical results corresponding to the plurality of hash table sets to obtain a final base number statistical result.
8. The method according to any of claims 5-7, further comprising, prior to said merging a plurality of said hash tables:
acquiring a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and the bit of each dimension element mapped in a preset bitmap array;
and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes so that the number of bits of the bitmap arrays in the hash tables is equal and the dimension elements of bit mapping at corresponding positions are the same.
9. A radix statistics apparatus, provided on a data node side, comprising:
the acquisition module is used for acquiring target data and dimension data corresponding to the target data;
the first hash module is used for carrying out hash calculation on the target data by utilizing a preset hash algorithm so as to obtain a hash value corresponding to the target data;
the bitmap generation module is used for generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics;
the second hash module is used for correspondingly storing a hash value corresponding to the target data and a bitmap array generated for the dimensional data corresponding to the target data by utilizing a preset hash table;
and the sending module is used for sending the hash table to a preset computing node so that the computing node can execute radix statistics processing according to the received bitmap array in the hash table and bit positions.
10. A radix statistics apparatus, provided on a node side, comprising:
the receiving module is used for receiving hash tables respectively sent by a plurality of data nodes; wherein, a hash value and a bitmap array are correspondingly stored in each hash table; the hash value is obtained by performing hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics;
and the counting module is used for combining the hash tables respectively sent by the data nodes and executing radix number counting processing on the bitmap arrays according to bit positions in the combined hash tables.
11. A cardinality statistics system, comprising, in the cardinality statistics system:
a plurality of data nodes and a compute node; the computing nodes are respectively connected with each data node;
each of the data nodes includes: the first hash interface and the second hash interface are connected with each other;
the first hash interface is used for acquiring target data and dimension data corresponding to the target data; performing hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data;
the second hash interface is used for generating a bitmap array for the dimensional data corresponding to the target data by using a preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to base number statistics; correspondingly storing the hash value corresponding to the target data and the bitmap array generated for the dimensional data corresponding to the target data by using a preset hash table; sending the hash table to a preset computing node;
the computing node is used for receiving hash tables respectively sent by a plurality of data nodes; and combining the hash tables respectively sent by the data nodes, and performing radix statistics processing on the bitmap arrays according to bit positions in the combined hash tables.
12. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1 to 4 or the method steps of any one of claims 5 to 8 when executing a program stored in a memory.
13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of one of the claims 1 to 4 or carries out the method steps of one of the claims 5 to 8.
CN202010339945.1A 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium Active CN111563109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010339945.1A CN111563109B (en) 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010339945.1A CN111563109B (en) 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111563109A true CN111563109A (en) 2020-08-21
CN111563109B CN111563109B (en) 2023-09-01

Family

ID=72070594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010339945.1A Active CN111563109B (en) 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111563109B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112114984A (en) * 2020-09-17 2020-12-22 清华大学 Graph data processing method and device
CN112162918A (en) * 2020-09-07 2021-01-01 北京达佳互联信息技术有限公司 Application program testing method and device and electronic equipment
CN112612790A (en) * 2020-12-17 2021-04-06 深圳前海微众银行股份有限公司 Card number configuration method, device, equipment and computer storage medium
CN113282247A (en) * 2021-06-24 2021-08-20 京东科技控股股份有限公司 Data storage method, data reading method, data storage device, data reading device and electronic equipment
CN113468179A (en) * 2021-07-09 2021-10-01 北京东方国信科技股份有限公司 Method, device and equipment for estimating base number of database and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020026438A1 (en) * 2000-08-28 2002-02-28 Walid Rjaibi Estimation of column cardinality in a partitioned relational database
US6957222B1 (en) * 2001-12-31 2005-10-18 Ncr Corporation Optimizing an outer join operation using a bitmap index structure
CN104866608A (en) * 2015-06-05 2015-08-26 中国人民大学 Query optimization method based on join index in data warehouse
CN108256087A (en) * 2018-01-22 2018-07-06 北京腾云天下科技有限公司 A kind of data importing, inquiry and processing method based on bitmap structure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020026438A1 (en) * 2000-08-28 2002-02-28 Walid Rjaibi Estimation of column cardinality in a partitioned relational database
US6957222B1 (en) * 2001-12-31 2005-10-18 Ncr Corporation Optimizing an outer join operation using a bitmap index structure
CN104866608A (en) * 2015-06-05 2015-08-26 中国人民大学 Query optimization method based on join index in data warehouse
CN108256087A (en) * 2018-01-22 2018-07-06 北京腾云天下科技有限公司 A kind of data importing, inquiry and processing method based on bitmap structure

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162918A (en) * 2020-09-07 2021-01-01 北京达佳互联信息技术有限公司 Application program testing method and device and electronic equipment
CN112114984A (en) * 2020-09-17 2020-12-22 清华大学 Graph data processing method and device
CN112612790A (en) * 2020-12-17 2021-04-06 深圳前海微众银行股份有限公司 Card number configuration method, device, equipment and computer storage medium
CN113282247A (en) * 2021-06-24 2021-08-20 京东科技控股股份有限公司 Data storage method, data reading method, data storage device, data reading device and electronic equipment
CN113468179A (en) * 2021-07-09 2021-10-01 北京东方国信科技股份有限公司 Method, device and equipment for estimating base number of database and storage medium
CN113468179B (en) * 2021-07-09 2024-03-19 北京东方国信科技股份有限公司 Base number estimation method, base number estimation device, base number estimation equipment and storage medium

Also Published As

Publication number Publication date
CN111563109B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN111563109A (en) Radix statistics method, apparatus, system, device and computer readable storage medium
WO2019095416A1 (en) Information pushing method and apparatus, and terminal device and storage medium
US20200372039A1 (en) Data processing method, apparatus, and system
CN107798038B (en) Data response method and data response equipment
RU2705429C1 (en) Method and device for distributed processing of stream data
CN110377569B (en) Log monitoring method, device, computer equipment and storage medium
US10015272B2 (en) Method and apparatus for compaction of data received over a network
US11036685B2 (en) System and method for compressing data in a database
WO2022126983A1 (en) Electronic report file export method, apparatus and device, and storage medium
EP3767483A1 (en) Method, device, system, and server for image retrieval, and storage medium
CN105354251B (en) Electric power cloud data management indexing means based on Hadoop in electric system
CN111629081A (en) Internet protocol IP address data processing method and device and electronic equipment
JP2015530666A (en) Data indexing method and apparatus
CN110430103B (en) Message monitoring method
WO2020199603A1 (en) Server vulnerability detection method and apparatus, device, and storage medium
WO2020088262A1 (en) Data analysis method and device, and storage medium
WO2024037094A1 (en) Device order data storage method and apparatus, and device order data query method and apparatus
CN111443899B (en) Element processing method and device, electronic equipment and storage medium
KR100906454B1 (en) Database log data management apparatus and method thereof
CN114528231A (en) Data dynamic storage method and device, electronic equipment and storage medium
WO2023097521A1 (en) Data model generation method and apparatus
CN111382379B (en) Method and terminal for importing configuration data with pictures
CN112953677A (en) Method and device for adding link identification to request message data
CN112329393A (en) Method, equipment and storage medium for generating short code ID
Liu et al. SEAD counter: Self-adaptive counters with different counting ranges

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant