CN114357069A - Big data sampling method and system based on distributed storage - Google Patents

Big data sampling method and system based on distributed storage Download PDF

Info

Publication number
CN114357069A
CN114357069A CN202111588216.0A CN202111588216A CN114357069A CN 114357069 A CN114357069 A CN 114357069A CN 202111588216 A CN202111588216 A CN 202111588216A CN 114357069 A CN114357069 A CN 114357069A
Authority
CN
China
Prior art keywords
sampling
data
index
sampling rate
time period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111588216.0A
Other languages
Chinese (zh)
Other versions
CN114357069B (en
Inventor
杨忠伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN202111588216.0A priority Critical patent/CN114357069B/en
Publication of CN114357069A publication Critical patent/CN114357069A/en
Application granted granted Critical
Publication of CN114357069B publication Critical patent/CN114357069B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a big data sampling method and a big data sampling system based on distributed storage, wherein the big data sampling method comprises the following steps: for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling from the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the former sampling rate according to a preset mode to obtain a latter sampling rate, and randomly sampling from the index data of the preset time period according to the latter sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, the result is used as the final sampling rate of the index; and sampling index data from the distributed storage module by adopting the final sampling rate of the index. And the final sampling rate is determined based on trial calculation, so that the server cost is reduced, and the calculation time is saved.

Description

Big data sampling method and system based on distributed storage
Technical Field
The invention relates to the field of data analysis, in particular to a big data sampling method and system based on distributed storage.
Background
With the rapid spread of the internet, a large amount of data is generated every day. For internet enterprises, a large data platform is needed to calculate mass data. This calculation is very computationally intensive and takes a long time.
For example, in a large-scale internet enterprise, the daily user behavior logs are billions, and in order to calculate a certain user behavior index, hundreds of servers are required to be operated for 4-5 hours to complete calculation. The calculation of the mass data wastes time and labor, and brings great cost to enterprises.
Disclosure of Invention
The embodiment of the invention provides a big data sampling method and system based on distributed storage, which determine the final sampling rate based on trial calculation, reduce the cost of a server and save the calculation time while ensuring the calculation accuracy to meet the service requirement.
To achieve the above object, in one aspect, an embodiment of the present invention provides a big data sampling method based on distributed storage, including:
storing big data including various index data by adopting a distributed storage module, and setting an initial sampling rate of indexes when the index data is sampled from the distributed storage module;
for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling from the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate;
and sampling index data from the distributed storage module by adopting the final sampling rate of the index.
In another aspect, an embodiment of the present invention provides a big data sampling system based on distributed storage, including:
the data storage unit is used for storing big data comprising various index data by adopting a distributed storage module;
the coordination manager is used for setting the initial sampling rate of the index when the index data is sampled from the distributed storage modules;
the sampling rate calculation unit is used for acquiring index data in a preset time period from the distributed storage module aiming at each index, and randomly sampling from the index data in the preset time period according to the initial sampling rate set by the coordination manager to obtain corresponding index sampling data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate;
and the sampling unit is used for sampling the index data from the distributed storage module by adopting the final sampling rate of the index.
The technical scheme has the following beneficial effects: the final sampling rate determination scheme based on trial calculation determination can ensure that the calculation accuracy meets the service requirement, meanwhile, the server cost is reduced, and the calculation time is saved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a big data sampling method based on distributed storage according to an embodiment of the present invention;
FIG. 2 is a block diagram of a big data sampling system based on distributed storage according to an embodiment of the present invention;
FIG. 3 is a diagram of a big data system architecture according to an embodiment of the present invention;
FIG. 4 is a distributed data storage structure of an embodiment of the present invention;
FIG. 5 is a diagram of a sampling computation architecture according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in accordance with an embodiment of the present invention, there is provided a method for big data sampling based on distributed storage, including:
s101: storing big data comprising various index data by adopting a distributed storage module;
big data (big data) is widely considered in the industry to be large in data size, exceeds the analysis and calculation capacity of the traditional database, and generally needs a plurality of machine clusters to calculate. 5V characteristics of big data: volume (bulk), Velocity (high speed), Variety (multiple), Value (low Value density), Veracity (authenticity). A data set with large scale which greatly exceeds the capability range of the traditional database software tools in the aspects of acquisition, storage, management and analysis has the four characteristics of large data scale, rapid data circulation, various data types and low value density.
S102: setting an initial sampling rate of indexes when the index data is sampled from the distributed storage module;
s103: for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling from the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate;
s104: and sampling index data from the distributed storage module by adopting the final sampling rate of the index.
Preferably, step 101 comprises:
and writing the big data into different distributed storage modules in the encapsulated hdfs system in sequence by adopting the encapsulated hdfs system.
Preferably, in step 103, the modifying the previous sampling rate to obtain the next sampling rate according to the preset manner includes: and correcting the previous sampling rate by adopting a half-iteration mode of the previous sampling rate to obtain the next sampling rate.
Preferably, in step 103, for each index, performing aggregation calculation on the index sample data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until a calculation result obtained by performing aggregation calculation on index sampling data obtained by random sampling according to the corrected sampling rate meets a preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as a final sampling rate of the index, wherein the method comprises the following steps of:
s1031: calculating the proportion of abnormal index data in the index sampling data in the preset time period aiming at the index sampling data of each sampling rate, and calculating the proportion error of the abnormal data corresponding to the sampling rate; the abnormal data proportion error is a difference value of the proportion of the abnormal index data in the index data of the preset time period during sampling and the proportion of the abnormal index data in the non-sampling time period, and the proportion of the abnormal index data in the non-sampling time period is the proportion of all the abnormal index data in the index data of the preset time period;
s1032: and when the proportion error of the abnormal data corresponding to the previous sampling rate is smaller than a preset error threshold and the proportion error of the abnormal data corresponding to the next sampling rate is larger than the preset error threshold, taking the previous sampling rate as the final sampling rate of the index.
Preferably, the method further comprises the following steps:
s105: pushing index data in a preset time period acquired from a distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period at specified time intervals for all indexes;
in step 103, the performing aggregation calculation on the index sample data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate specifically includes:
after index data in a preset time period acquired from a distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period are received, aggregation calculation is carried out on the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
As shown in fig. 2, in conjunction with an embodiment of the present invention, there is provided a big data sampling system based on distributed storage, including:
the data storage unit 21 is used for storing big data including various index data by adopting a distributed storage module;
the coordination manager 22 is configured to set an initial sampling rate of the index when the index data is sampled from the distributed storage modules;
the sampling rate calculation unit 23 is configured to, for each index, obtain index data in a preset time period from the distributed storage module, and randomly sample from the index data in the preset time period according to an initial sampling rate set by the coordination manager to obtain corresponding index sample data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate;
and the sampling unit 24 is configured to sample the index data from the distributed storage module by using the final sampling rate of the index.
Preferably, the data storage unit 21 includes:
and the encapsulated hdfs system is used for writing the big data into different distributed storage modules in the encapsulated hdfs system in sequence.
Preferably, the sampling rate calculation unit 23 includes:
a sampling rate modification subunit 231, configured to modify the previous sampling rate in a preset manner to obtain a next sampling rate, where modifying the previous sampling rate in the preset manner to obtain the next sampling rate includes: and correcting the previous sampling rate by adopting a half-iteration mode of the previous sampling rate to obtain the next sampling rate.
Preferably, the sampling rate calculation unit 23 includes:
a sampling rate verification subunit 232, configured to calculate, for the index sample data of each sampling rate, an occupation ratio of abnormal index data in the index sample data in the index data of the preset time period, and calculate an abnormal data occupation ratio error corresponding to the sampling rate; the abnormal data proportion error is a difference value of the proportion of the abnormal index data in the index data of the preset time period during sampling and the proportion of the abnormal index data in the non-sampling time period, and the proportion of the abnormal index data in the non-sampling time period is the proportion of all the abnormal index data in the index data of the preset time period;
the sampling rate determining subunit 233 is configured to use the previous sampling rate as the final sampling rate of the index when the ratio error of the abnormal data corresponding to the previous sampling rate is smaller than the preset error threshold and the ratio error of the abnormal data corresponding to the next sampling rate is larger than the preset error threshold.
Preferably, the data pushing unit 25 is further included, and the sampling rate calculating unit 23 includes an aggregation calculating subunit 233, where:
the data pushing unit 25 is configured to push, at specified time intervals, index data within a preset time period acquired from the distributed storage module and index sampling data obtained by random sampling within the index data within the preset time period, for all the indexes;
the aggregation operator unit 234 is configured to, after receiving the index data pushed by the data pushing unit and obtained from the distributed storage module within the preset time period and the index sample data obtained by random sampling within the index data within the preset time period, perform aggregation calculation on the index sample data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
The beneficial effects obtained by the invention are as follows:
the method can quickly calculate mass data, greatly saves server cost and saves calculation time.
The sampling rate is determined by trial calculation, and the method can flexibly adapt to the calculation accuracy of various indexes.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
The abbreviations and key terms to which the present invention relates are defined as follows:
real-time sampling: the invention discloses a quick big data computing system which is initiated by the patent and is much faster than the traditional big data computing system.
Distributed: relatively centralized, distributing tasks to many servers.
Data sampling: and acquiring a part of data from the mass data set according to a sampling algorithm to analyze the data.
Distributed data sampling: the sampling operation is performed in a distributed mode, and the operation speed is higher.
The invention relates to a distributed real-time sampling-based rapid big data computing system, belonging to the technical field of big data and data analysis; the distributed real-time sampling-based rapid big data computing system is realized, the distributed sampling can be carried out on the data while computing, the server and the time required by the operation can be greatly saved, namely, a distributed real-time sampling mechanism is adopted, and the rapid big data computing can be realized; and the final sampling rate is determined based on trial calculation, so that the calculation accuracy can be ensured to meet the service requirement, the server cost is reduced, and the calculation time is saved.
For example, the same log of billions only needs several servers, and the operation of the corresponding service index can be completed in a dozen minutes. The cost is saved, and the operation time is also saved. Can bring good income for enterprises. The system is a big data computing system with extremely strong practicability.
The architecture diagram of the big data system of the technical scheme of the invention is shown in fig. 3, and the distributed real-time sampling big data computing system mainly comprises a data storage module (data storage unit), a sampling computing module (sampling rate computing unit) and a coordination manager. Wherein:
the data storage module, as shown in FIG. 4, is responsible for storing the mass data to be computed. And storing the big data comprising various index data by adopting a distributed storage module, and sequentially writing the big data into different distributed storage modules in the encapsulated hdfs system by adopting the encapsulated hdfs system. Specifically, the method comprises the following steps:
because of the large amount of data, a single machine cannot accommodate it. The data storage module adopts a distributed file storage system. Is realized by packaging the hdfs system. In order to facilitate quick access to data, the files are required to be written sequentially, one file every 1G in size. The file written in sequence can be read quickly. One file per 1G size can control the number of files not to be excessive (not too many files of 1G size).
A sampling calculation module: for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling from the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate.
Specifically, the method comprises the following steps: a single sampling calculation module is configured as shown in fig. 5, and acquires data from the data storage module, and communicates with the coordination manager, determines how much data to be currently processed is retained according to the sampling rate given by the coordination manager, and performs corresponding aggregation calculation. Different data indexes and sampling rates have different influences on the calculation accuracy. And therefore a uniform sampling rate cannot be used. Modifying the previous sampling rate to obtain the next sampling rate according to a preset mode comprises the following steps: and correcting the previous sampling rate by a half-iteration mode of the previous sampling rate to obtain the next sampling rate (the sampling rate is determined by a half-iteration trial calculation mode). Calculating the proportion of abnormal index data in the index sampling data in the preset time period aiming at the index sampling data of each sampling rate, and calculating the proportion error of the abnormal data corresponding to the sampling rate; the abnormal data proportion error is a difference value of the proportion of the abnormal index data in the preset time period in the sampling process to the proportion of the abnormal index data in the non-sampling process, and the proportion of the abnormal index data in the non-sampling process is the proportion of all the abnormal index data in the preset time period. And when the proportion error of the abnormal data corresponding to the previous sampling rate is smaller than a preset error threshold and the proportion error of the abnormal data corresponding to the next sampling rate is larger than the preset error threshold, taking the previous sampling rate as the final sampling rate of the index. For example, the initial sampling rate is 50%, and the coordination manager will take a small batch of data and try out 50% and the halved 25%. If the calculation error between the two is acceptable, 25% is halved again (decimal rounding is omitted), the sampling rate is 12% for trial calculation, and the trial calculation is compared with the trial calculation of 50% sampled small data. Until the sampling rate is greater than an acceptable error set in advance. The sample rate at which the maximum error is acceptable becomes the acceptable sample rate.
The error is generally set in advance, and each index is different. The level of service acceptance needs to be seen. For example, the video playing second opening rate is a proportional index, and the error acceptable to the service is 0.03%. The sampling rate of the index can be 6 percent (6 parts are taken for each 100 parts.)
The sampling rate setting principle is as follows:
(1) the initial sampling rate was 50%.
(2) And continuously carrying out halving iteration when the sampling rate is tried to be calculated. The sampling rate is divided by two (the decimal part is omitted).
(3) And (4) according to the error rate preset by the service, iteratively calculating a corresponding sampling rate, and storing the sampling rate in the coordination manager.
The sampling rate is determined by trial calculation, and the method can flexibly adapt to the calculation accuracy of various indexes.
The sampling rate is stored in the coordination manager, so that the coordination of the sampling calculation modules can be ensured.
For example, the sampling rate is 3% sampling, and when each sampling calculation module reads the data file sequentially, 3 copies of the data are randomly reserved for each 100 copies of data to perform calculation.
The data sampled are not hit, and are directly discarded without calculation, so that the calculation resources are greatly saved.
The above example is a 3% sampling rate, and in actual use, there are different sampling rates based on different service acceptance accuracy rates.
If the accuracy of the service requirement is too high, the system is not suitable for operation. In practice, the normal traffic demand can be met with a sampling rate of 12% or even higher.
And pushing index data in a preset time period acquired from a distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period at specified time intervals for all indexes. Specifically, the method comprises the following steps: while sampling, a batch of data is sent to the computing module to be computed (all the data computed by the computing module is sampled) at specified time intervals (the interval time is set by the coordination manager, for example, every 10 seconds, the interval time can be configured, and the default is 10 seconds according to the requirement setting of a business side). The computing module encapsulates a distributed computing engine, here implemented using spark streaming.
The coordination manager: is the main control core of the whole big data system. And the data acquisition module is responsible for coordinating the acquisition of data from the data storage module by each sampling calculation module, sampling according to a set sampling rate and executing corresponding big data operation.
The beneficial effects obtained by the invention are as follows:
the method can quickly calculate mass data, greatly saves server cost and saves calculation time.
The sampling rate is determined by trial calculation, and the method can flexibly adapt to the calculation accuracy of various indexes.
The accuracy of the calculation result is maintained by the following means:
(1) the calculation index is preferably a proportion index, and the proportion numerator and the proportion denominator are the same log and use the same sampling rate.
(2) The sampling needs to be done uniformly and as much as possible to ensure that there is data in a small time interval. For example, after sampling, no data for a certain second or seconds will occur.
(3) And determining the sampling rate, and trial-calculating by using a sampling coordination manager.
The sampling rate setting principle is as follows:
(a) the initial sampling rate was 50%.
(b) And continuously carrying out halving iteration when the sampling rate is tried to be calculated. The sampling rate is divided by two (the decimal part is omitted).
(c) And (4) according to the error rate preset by the service, iteratively calculating a corresponding sampling rate, and storing the sampling rate in the coordination manager.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A big data sampling method based on distributed storage is characterized by comprising the following steps:
storing big data including various index data by adopting a distributed storage module, and setting an initial sampling rate of indexes when the index data is sampled from the distributed storage module;
for each index, acquiring index data in a preset time period from a distributed storage module, and randomly sampling from the index data in the preset time period according to an initial sampling rate to obtain corresponding index sampling data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate;
and sampling index data from the distributed storage module by adopting the final sampling rate of the index.
2. The big data sampling method based on distributed storage according to claim 1, wherein the big data comprising various index data are stored by using a distributed storage module, and the method comprises the following steps:
and writing the big data into different distributed storage modules in the encapsulated hdfs system in sequence by adopting the encapsulated hdfs system.
3. The big data sampling method based on distributed storage according to claim 1, wherein the modifying the previous sampling rate to the next sampling rate according to the preset manner comprises: and correcting the previous sampling rate by adopting a half-iteration mode of the previous sampling rate to obtain the next sampling rate.
4. The big data sampling method based on distributed storage according to claim 3, wherein for each index, the aggregation calculation is performed on the index sample data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until a calculation result obtained by performing aggregation calculation on index sampling data obtained by random sampling according to the corrected sampling rate meets a preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as a final sampling rate of the index, wherein the method comprises the following steps of:
calculating the proportion of abnormal index data in the index sampling data in the preset time period aiming at the index sampling data of each sampling rate, and calculating the proportion error of the abnormal data corresponding to the sampling rate; the abnormal data proportion error is a difference value of the proportion of the abnormal index data in the index data of the preset time period during sampling and the proportion of the abnormal index data in the non-sampling time period, and the proportion of the abnormal index data in the non-sampling time period is the proportion of all the abnormal index data in the index data of the preset time period;
and when the proportion error of the abnormal data corresponding to the previous sampling rate is smaller than a preset error threshold and the proportion error of the abnormal data corresponding to the next sampling rate is larger than the preset error threshold, taking the previous sampling rate as the final sampling rate of the index.
5. The distributed storage based big data sampling method of claim 1, further comprising:
pushing index data in a preset time period acquired from a distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period at specified time intervals for all indexes;
the aggregating calculation of the index sample data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate specifically includes:
after index data in a preset time period acquired from a distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period are received, aggregation calculation is carried out on the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
6. A distributed storage based big data sampling system, comprising:
the data storage unit is used for storing big data comprising various index data by adopting a distributed storage module;
the coordination manager is used for setting the initial sampling rate of the index when the index data is sampled from the distributed storage modules;
the sampling rate calculation unit is used for acquiring index data in a preset time period from the distributed storage module aiming at each index, and randomly sampling from the index data in the preset time period according to the initial sampling rate set by the coordination manager to obtain corresponding index sampling data; correcting the previous sampling rate according to a preset mode to obtain a next sampling rate, and randomly sampling from the index data of the preset time period according to the next sampling rate to obtain corresponding index sampling data; performing aggregation calculation on the index sampling data corresponding to each sampling rate to obtain a calculation result corresponding to the sampling rate; until the calculation result obtained by carrying out aggregation calculation on the index sampling data obtained by random sampling according to the corrected sampling rate meets the preset requirement, taking the sampling rate corresponding to the calculation result meeting the preset requirement as the final sampling rate of the index; wherein the initial sampling rate is a first sampling rate;
and the sampling unit is used for sampling the index data from the distributed storage module by adopting the final sampling rate of the index.
7. The distributed storage based big data sampling system of claim 6, wherein the data storage unit comprises:
and the encapsulated hdfs system is used for writing the big data into different distributed storage modules in the encapsulated hdfs system in sequence.
8. The distributed storage based big data sampling system of claim 6, wherein the sampling rate calculation unit comprises:
a sampling rate modification subunit, configured to modify a previous sampling rate according to a preset manner to obtain a next sampling rate, where modifying the previous sampling rate according to the preset manner to obtain the next sampling rate includes: and correcting the previous sampling rate by adopting a half-iteration mode of the previous sampling rate to obtain the next sampling rate.
9. The distributed storage based big data sampling system of claim 8, wherein the sampling rate calculation unit comprises:
the sampling rate verification subunit is used for calculating the proportion of abnormal index data in the index sampling data in the preset time period according to the index sampling data of each sampling rate, and calculating the proportion error of the abnormal data corresponding to the sampling rate; the abnormal data proportion error is a difference value of the proportion of the abnormal index data in the index data of the preset time period during sampling and the proportion of the abnormal index data in the non-sampling time period, and the proportion of the abnormal index data in the non-sampling time period is the proportion of all the abnormal index data in the index data of the preset time period;
and the sampling rate determining subunit is used for taking the previous sampling rate as the final sampling rate of the index when the proportion error of the abnormal data corresponding to the previous sampling rate is smaller than the preset error threshold and the proportion error of the abnormal data corresponding to the next sampling rate is larger than the preset error threshold.
10. The distributed storage based big data sampling system of claim 6, further comprising a data pushing unit, wherein the sampling rate calculation unit comprises an aggregation calculation subunit, wherein:
the data pushing unit is used for pushing index data in a preset time period acquired from the distributed storage module and index sampling data obtained by random sampling in the index data in the preset time period at specified time intervals according to all indexes;
and the aggregation calculation subunit is configured to, after receiving the index data within the preset time period, which is pushed by the data pushing unit and acquired from the distributed storage module, and the index sampling data obtained by randomly sampling the index data within the preset time period, perform aggregation calculation on the index sampling data corresponding to the sampling rate to obtain a calculation result corresponding to the sampling rate.
CN202111588216.0A 2021-12-23 2021-12-23 Big data sampling method and system based on distributed storage Active CN114357069B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111588216.0A CN114357069B (en) 2021-12-23 2021-12-23 Big data sampling method and system based on distributed storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111588216.0A CN114357069B (en) 2021-12-23 2021-12-23 Big data sampling method and system based on distributed storage

Publications (2)

Publication Number Publication Date
CN114357069A true CN114357069A (en) 2022-04-15
CN114357069B CN114357069B (en) 2024-05-28

Family

ID=81102301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111588216.0A Active CN114357069B (en) 2021-12-23 2021-12-23 Big data sampling method and system based on distributed storage

Country Status (1)

Country Link
CN (1) CN114357069B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060274763A1 (en) * 2005-06-03 2006-12-07 Error Christopher R Variable sampling rates for website visitation analysis
CN107423433A (en) * 2017-08-03 2017-12-01 聚好看科技股份有限公司 A kind of data sampling rate control method and device
WO2018027466A1 (en) * 2016-08-08 2018-02-15 马岩 Method and system for storing big data in distributed system
US20180365523A1 (en) * 2016-02-29 2018-12-20 Alibaba Group Holding Limited Method and system for training machine learning system
CN113807396A (en) * 2021-08-12 2021-12-17 华南理工大学 Method, system, device and medium for detecting abnormality of high-dimensional data of Internet of things

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060274763A1 (en) * 2005-06-03 2006-12-07 Error Christopher R Variable sampling rates for website visitation analysis
US20180365523A1 (en) * 2016-02-29 2018-12-20 Alibaba Group Holding Limited Method and system for training machine learning system
WO2018027466A1 (en) * 2016-08-08 2018-02-15 马岩 Method and system for storing big data in distributed system
CN107423433A (en) * 2017-08-03 2017-12-01 聚好看科技股份有限公司 A kind of data sampling rate control method and device
CN113807396A (en) * 2021-08-12 2021-12-17 华南理工大学 Method, system, device and medium for detecting abnormality of high-dimensional data of Internet of things

Also Published As

Publication number Publication date
CN114357069B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
US10204147B2 (en) System for capture, analysis and storage of time series data from sensors with heterogeneous report interval profiles
CN109710612B (en) Vector index recall method and device, electronic equipment and storage medium
CN104750703B (en) A kind of method and apparatus for improving data accuracy
CN109388550B (en) Cache hit rate determination method, device, equipment and readable storage medium
CN114048055A (en) Time series data abnormal root cause analysis method and system
CN113946294A (en) Distributed storage system and data processing method thereof
CN111125018B (en) File exception tracing method, device, equipment and storage medium
CN117632905B (en) Database management method and system based on cloud use records
CN108923967B (en) Duplication-removing flow recording method, duplication-removing flow recording device, server and storage medium
CN114357069A (en) Big data sampling method and system based on distributed storage
CN110109970B (en) Data query processing method and device
CN111913937A (en) Database operation and maintenance method and device
CN109658985B (en) Redundancy removal optimization method and system for gene reference sequence
CN109542909B (en) Method and system for identifying associative storage devices in big data storage system
CN115829736A (en) Model parameter testing method and device, storage medium and electronic equipment
CN110244096B (en) Method for automatically discovering and processing electric meter full code in electric energy metering system
CN112149036A (en) Method and system for identifying batch abnormal interaction behaviors
CN109978038B (en) Cluster abnormity judgment method and device
CN111353860A (en) Product information pushing method and system
CN108984101B (en) Method and device for determining relationship between events in distributed storage system
CN113312218A (en) Method and device for detecting magnetic disk
CN112861128B (en) Method and system for identifying machine account numbers in batches
CN112988736B (en) Mass data quality checking method and system
CN116776310B (en) Automatic user account identification method and device, computer equipment and storage medium
CN110750432B (en) IO performance analysis method and system of distributed storage system and related components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant