WO2016184316A1 - 数据限流方法和装置 - Google Patents

数据限流方法和装置 Download PDF

Info

Publication number
WO2016184316A1
WO2016184316A1 PCT/CN2016/081216 CN2016081216W WO2016184316A1 WO 2016184316 A1 WO2016184316 A1 WO 2016184316A1 CN 2016081216 W CN2016081216 W CN 2016081216W WO 2016184316 A1 WO2016184316 A1 WO 2016184316A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
value
piece
sensitive hash
hash value
Prior art date
Application number
PCT/CN2016/081216
Other languages
English (en)
French (fr)
Inventor
胡四海
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016184316A1 publication Critical patent/WO2016184316A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a data limiting method and apparatus.
  • the existing current limiting scheme generally includes two types, a random current limiting scheme and a hashing scheme.
  • the random current limiting scheme is usually pure random current limiting. In this scheme, the data removed and retained is completely random, and the diversity of the current limiting data cannot be guaranteed.
  • the Hash scheme calculates the hash value to determine whether the two data are the same, and preferentially removes the same data, but cannot distinguish between the two similar data.
  • the purpose of the present application is to solve at least one of the technical problems in the related art to some extent.
  • the first object of the present application is to propose a data current limiting method.
  • the method can remove data according to the similarity and difference of the data, and can preferentially remove the same data, thereby maximizing the diversity of the data after the current limiting.
  • a second object of the present application is to propose a data current limiting device.
  • the data current limiting method of the first aspect of the present application includes: calculating a local sensitive hash value of the received data; and according to the locally sensitive hash value of the data and the saved at least one piece of data. a locally sensitive hash value, a similar value of the data to the at least one piece of data is calculated; and whether the data is saved is determined according to the similarity value.
  • the data current limiting method of the embodiment of the present application calculates a local sensitive hash value of the received data, and then calculates the data according to the local sensitive hash value of the data and the locally sensitive hash value of the saved at least one piece of data. At least one similarity value of the data, and finally determining whether to save the above data according to the similarity value, so that the data can be removed according to the similarity and difference of the data, and the same data can be preferentially removed, so that the diversity of the data after the current limiting can be achieved. maximize.
  • the data current limiting device of the second aspect of the present application includes: a calculation module, Calculating a locally sensitive hash value of the received data, and calculating a similarity of the data to the at least one piece of data according to a locally sensitive hash value of the data and a locally sensitive hash value of the saved at least one piece of data And a determining module, configured to determine whether to save the data according to the similarity value calculated by the computing module.
  • the calculation module calculates a local sensitive hash value of the received data, and calculates the above according to the local sensitive hash value of the data and the locally sensitive hash value of the saved at least one piece of data.
  • the determining module determines whether to save the foregoing data according to the similarity value calculated by the calculating module, so that the data may be removed according to the similarity and difference of the data, and the same data may be preferentially removed, thereby Maximize the diversity of data after current limiting.
  • FIG. 1 is a flow chart of an embodiment of a data limiting method of the present application.
  • FIG. 2 is a flow chart of another embodiment of a data limiting method of the present application.
  • FIG. 3 is a schematic structural diagram of an embodiment of a data current limiting device of the present application.
  • FIG. 4 is a schematic structural view of another embodiment of a data current limiting device of the present application.
  • the data limiting method in this embodiment may be implemented by a data limiting device, where the data limiting device may be disposed between an upstream server and a downstream server, specifically The data limiting device can be integrated in the upstream server or the downstream server to implement the function of limiting the data sent by the upstream server to the downstream server. Alternatively, the data limiting device may be disposed in a separate server or as a separate server, the independent server being located between the upstream server and the downstream server. The function of limiting the data sent by the upstream server to the downstream server is implemented.
  • the data limiting method may include:
  • Step 101 Calculate a Locality Sensitive Hashing (LSH) value of the received data.
  • LSH Locality Sensitive Hashing
  • the received data is the data sent by the upstream server, and after receiving the data sent by the upstream server, the data limiting device limits the flow and sends the data to the downstream server.
  • Step 102 Calculate a similarity value between the data and the at least one piece of data according to an LSH value of the data and an LSH value of the saved at least one piece of data.
  • the at least one piece of data saved may be at least one piece of data that has been saved in the cache, and the cache is a cache opened in the data limiting device or in a server including the data limiting device.
  • calculating a similarity value between the data and the at least one piece of data according to the LSH value of the data and the stored LSH value of the at least one piece of data may be: calculating a difference between an LSH value of the data and an LSH value of the at least one piece of data a value, and calculating a similar value of the above data to the at least one piece of data according to the difference value.
  • the data current limiting device calculates the similarity value between the data and the at least one piece of data according to the difference value, and may calculate the similarity value between the data and the at least one piece of data according to the formula (1).
  • D i is a difference value between the LSH value of the data and the LSH value of the at least one piece of data
  • S i is a similarity value between the data and the at least one piece of data, i is an integer, and i ⁇ 1.
  • the difference between the LSH value of the data and the LSH value of the at least one piece of data may be a Hamming distance (Hmming Distance; HD) of the LSH value of the data and the LSH value of the at least one piece of data.
  • HD Hamming Distance
  • Step 103 Determine whether to save the above data according to the similarity value.
  • the data current limiting device determining whether to save the data according to the similarity value may be: the data current limiting device calculates a pass probability of the data according to a maximum value of the similarity values and a predetermined sampling rate; if the pass probability is greater than or equal to The preset threshold is used to save the above data; and if the pass probability is less than the preset threshold, the data is not saved.
  • the preset threshold may be set according to the implementation requirement and/or the system performance, and the size of the preset threshold is not limited in this embodiment. For example, the preset threshold may be It is 50%.
  • saving the foregoing data may be: storing the foregoing data in the cache. Further, after the data is saved, the data current limiting device may further send the data saved in the cache to the downstream server, so that the data sent by the upstream server is restricted and sent to the downstream server.
  • the data current limiting device calculates the pass probability of the data according to the maximum value of the similarity value and the predetermined sampling rate, and may calculate the pass probability of the data according to the formula (2).
  • L is a predetermined sampling rate, for example: L can be 75%; S i is a similar value of the above data and the at least one piece of data, i is an integer, i ⁇ 1; Max (S i ) is the maximum of the above similar values.
  • the data current limiting device calculates an LSH value of the received data, and then calculates a similarity value between the data and the at least one piece of data according to the LSH value of the data and the stored LSH value of the at least one piece of data, and finally according to the foregoing
  • the similarity value determines whether the above data is saved, so that the degree of similarity and difference of the data can be removed, the data can be removed, and the same data can be preferentially removed, so that the diversity of the data after the current limit can be maximized.
  • the following describes the data current limiting method provided by the present application by taking the e-commerce platform transaction data as an example.
  • the predetermined sampling rate is 75%, that is, the current limit is required to remove 25% of the traffic.
  • the data of No.1 and No.4 are exactly the same.
  • the predetermined sampling rate is 75% (that is, the current limit is 25%), and the actual difference is expected to be the smallest difference.
  • the two pieces of data are: No. 4 (no difference from No. 1) and No. 2 (different from No. 1), that is, data No. 1, No. 3, No. 5, No. 6, No. 7, and No. 8 are retained.
  • This application uses LSH to make the sampled data diversified as much as possible, and to retain sufficient data difference. It can solve the problem of data loss of random current limiting scheme, and can also solve the Hash scheme. It can only judge similar and cannot judge similar. The problem is that after Hash, the problem of the difference in the original content cannot be preserved.
  • the data limiting method may include:
  • step 201 the cache space is opened.
  • the cache space is a cache space opened in the data current limiting device or in the server including the data current limiting device, and is used for buffering the LSH value of the latest N latest data sent by the upstream server.
  • N can be configured according to the actual situation. It is recommended to be the full value within 5 minutes.
  • the upper limit is 1024 to ensure that the memory limit is a few K.
  • the data current limiting device first calculates and caches the LSH value for the traffic data according to the sequence number sequence. After the data of the first data flows into the data current limiting device, the cache is as shown in Table 3. .
  • Step 202 When new traffic data comes in, calculate an LSH value of the data to be stored in the cache, and calculate a difference value between an LSH value of the data to be stored in the cache and an LSH value of at least one piece of data in the cache.
  • the above difference value is expressed by the Hamming distance HD.
  • the data current limiting device calculates the LSH value of the No. 2 data, and the LSH value of the No. 2 data can be as shown in Table 4.
  • the data current limiting device calculates a difference value from the LSH value of the No. 1 data in the cache, wherein the HD calculation method may be: the number of corresponding bits having different LSH values, that is, comparing the LSH value of the No. 2 data with the No. 1 data. What is the difference in the number of bits of the LSH value, and how many HDs are.
  • the HD calculation method may be: the number of corresponding bits having different LSH values, that is, comparing the LSH value of the No. 2 data with the No. 1 data. What is the difference in the number of bits of the LSH value, and how many HDs are.
  • it can be quickly calculated by XOR.
  • the comparison between the LSH value of the No. 2 data and the LSH value of the No. 1 data can be as shown in Table 5.
  • Step 203 Calculate a similarity value of the data to be cached and the at least one piece of data in the cache according to the difference value.
  • the data current limiting means can then calculate the above similar values according to equation (1).
  • Step 204 Calculate a pass probability of the data to be cached according to the maximum value of the similarity value and the predetermined sampling rate.
  • the above pass probability may be calculated according to the formula (2), and the data current limiting device may calculate the pass probability of obtaining the No. 2 data according to the formula (2) to be 5.83%.
  • Step 205 Determine whether the pass probability is greater than or equal to a preset threshold. If yes, step 206 is performed; if the pass probability is less than the preset threshold, step 207 is performed.
  • the preset threshold may be set according to the implementation requirement and/or the system performance, and the size of the preset threshold is not limited in this embodiment. However, in this embodiment, the preset threshold is 50% as an example for description.
  • Step 206 The data to be stored in the cache is stored in the cache, and the process ends.
  • Step 207 The data to be cached is not stored in the cache, and the process ends.
  • steps 202 to 207 may be repeated to limit the data of No. 3 to No. 8. Since the No. 2 data is not stored in the cache, the data in the cache is as shown in Table 6.
  • the LSH value of the data No. 3 and the HD of the LSH value of the No. 1 data are 10, so that the pass probability of obtaining the No. 3 data is 55.6%, which is greater than 50%, so the No. 3 is obtained.
  • the data is stored in the cache, and the data in the cache is as shown in Table 7.
  • the HD value of the LSH value of the No. 4 data and the LSH value of the No. 1 data is 0, and the LSH value of the No. 4 data and the HD of the LSH value of the No. 3 data are 10. Therefore, the pass probability of obtaining the No. 4 data is 0, so the No. 4 data is not stored in the above cache, and the data in the cache is still as shown in Table 7.
  • the HD value of the LSH value of the No. 5 data and the LSH value of the No. 1 data is 9, and the LSH value of the No. 5 data and the HD of the LSH value of the No. 3 data are 11. Therefore, the probability of obtaining the data of No. 5 can be calculated to be 50%, so the data of No. 5 is stored in the cache, and the data in the cache is as shown in Table 8.
  • the data current limiting device can perform data limiting according to the difference in data similarity, and preferentially remove the same data, thereby maximizing the diversity of the data after the current limiting.
  • FIG. 3 is a schematic structural diagram of an embodiment of the data current limiting device of the present application.
  • the data current limiting device in this embodiment can implement the process of the embodiment shown in FIG. 1 of the present application.
  • the data limiting device can be Including: a calculation module 31 and a determination module 32;
  • the calculation module 31 is configured to calculate an LSH value of the received data, and calculate a similarity value between the data and the at least one piece of data according to an LSH value of the data and an LSH value of the saved at least one piece of data; wherein, the calculating The module 31 is specifically configured to calculate a difference value between the LSH value of the data and the stored LSH value of the at least one piece of data, and calculate a similarity value between the data and the at least one piece of data according to the difference value.
  • the difference value calculated by the calculation module 31 may be a Hamming distance between the LSH value of the data and the LSH value of the at least one piece of data; specifically, the calculation module 31 may calculate the data and the at least one piece of data according to the formula (1). Similar values.
  • the determining module 32 is configured to determine whether to save the foregoing data according to the similarity value calculated by the calculating module 31.
  • the data limiting device may be disposed between the upstream server and the downstream server. Specifically, the data limiting device may be integrated into the upstream server or the downstream server to implement the function of limiting the data sent by the upstream server to the downstream server. Alternatively, the data current limiting device may be disposed in a separate server or as a separate server, and the independent server is located between the upstream server and the downstream server to implement upstream services. The function of the data sent to the downstream server is limited. The data received by the upstream server is the data sent by the upstream server. After receiving the data sent by the upstream server, the data limiting device limits the flow and sends the data to the downstream server.
  • the at least one piece of data saved may be at least one piece of data that has been saved in the cache, and the cache is a cache opened in the data limiting device or in a server including the data limiting device.
  • the calculation module 31 calculates the LSH value of the received data, and calculates a similarity value between the data and the at least one piece of data according to the LSH value of the data and the LSH value of the saved at least one piece of data; 32: determining whether to save the foregoing data according to the similarity value calculated by the calculating module 31, so that data can be removed according to the similarity and difference of the data, and the same data can be preferentially removed, thereby maximizing the diversity of the data after the current limiting. .
  • the determining module 32 can The method includes: a probability calculation sub-module 321 and a deposit sub-module 322;
  • the probability calculation sub-module 321 is configured to calculate a pass probability of the data according to a maximum value of the similarity values calculated by the calculation module 31 and a predetermined sampling rate; specifically, the probability calculation sub-module 321 can calculate the above according to the formula (2).
  • the probability of passing data is configured to calculate a pass probability of the data according to a maximum value of the similarity values calculated by the calculation module 31 and a predetermined sampling rate; specifically, the probability calculation sub-module 321 can calculate the above according to the formula (2). The probability of passing data.
  • the storage sub-module 322 is configured to save the foregoing data when the probability of passing by the probability calculation sub-module 321 is greater than or equal to a preset threshold.
  • the preset threshold may be set according to the implementation requirement and/or the system performance, and the size of the preset threshold is not limited in this embodiment. For example, the preset threshold may be 50%. .
  • the saving of the data may be: the depositing sub-module 322 storing the data in the cache. Further, after the data is saved, the data current limiting device may further send the data saved in the cache to the downstream server, so that the data sent by the upstream server is restricted and sent to the downstream server.
  • the data current limiting device can perform data limiting according to the difference in data similarity, and preferentially remove the same data, thereby maximizing the diversity of the data after the current limiting.
  • Any process or method description in the flowchart or otherwise described herein may be understood to include a Modules, segments or portions of code of one or more executable instructions for implementing steps of a particular logical function or process, and the scope of preferred embodiments of the application includes additional implementations, which may not be as shown or The order of discussion includes performing functions in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which should be understood by those skilled in the art to which the embodiments of the present application pertain.
  • portions of the application can be implemented in hardware, software, firmware, or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals.
  • Discrete logic circuit, ASIC with suitable combination logic gate Programmable Gate Array (PGA), Field Programmable Gate Array (FPGA).
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本申请提出一种数据限流方法和装置,该数据限流方法包括:计算接收到的数据的局部敏感哈希值;根据所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算所述数据与所述至少一条数据的相似值;根据所述相似值确定是否保存所述数据。本申请可以根据数据的相似程度和差异,去除数据,并可以优先去除相同数据,从而可以使限流后的数据的多样性最大化。

Description

数据限流方法和装置
本申请要求2015年05月15日递交的申请号为201510250007.3、发明名称为“数据限流方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,尤其涉及一种数据限流方法和装置。
背景技术
计算机***间调用,出于各种原因(资源不足、***压力大等),常常面临需要进行流量限制的情况。现有的限流方案,一般包括两种,随机限流方案和哈希(Hash)方案。其中,随机限流方案通常为纯随机限流,这种方案中,去除和保留的数据,完全随机,无法保证限流数据的多样性。而Hash方案是通过计算得到的Hash值,判断两条数据是否相同,优先去除相同的数据,但对于相似的两条数据却无法区分。
发明内容
本申请的目的旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本申请的第一个目的在于提出一种数据限流方法。该方法可以根据数据的相似程度和差异,去除数据,并可以优先去除相同数据,从而可以使限流后的数据的多样性最大化。
本申请的第二个目的在于提出一种数据限流装置。
为了实现上述目的,本申请第一方面实施例的数据限流方法,包括:计算接收到的数据的局部敏感哈希值;根据所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算所述数据与所述至少一条数据的相似值;根据所述相似值确定是否保存所述数据。
本申请实施例的数据限流方法,计算接收到的数据的局部敏感哈希值,然后根据上述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算上述数据与至少一条数据的相似值,最后根据上述相似值确定是否保存上述数据,从而可以实现根据数据的相似程度和差异,去除数据,并可以优先去除相同数据,从而可以使限流后的数据的多样性最大化。
为了实现上述目的,本申请第二方面实施例的数据限流装置,包括:计算模块,用 于计算接收到的数据的局部敏感哈希值,并根据所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算所述数据与所述至少一条数据的相似值;确定模块,用于根据所述计算模块计算的相似值确定是否保存所述数据。
本申请实施例的数据限流装置,计算模块计算接收到的数据的局部敏感哈希值,并根据上述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算上述数据与至少一条数据的相似值;然后,确定模块根据上述计算模块计算的相似值确定是否保存上述数据,从而可以实现根据数据的相似程度和差异,去除数据,并可以优先去除相同数据,从而可以使限流后的数据的多样性最大化。
本申请附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本申请数据限流方法一个实施例的流程图;
图2为本申请数据限流方法另一个实施例的流程图;
图3为本申请数据限流装置一个实施例的结构示意图;
图4为本申请数据限流装置另一个实施例的结构示意图。
具体实施方式
下下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。相反,本申请的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。
图1为本申请数据限流方法一个实施例的流程图,本实施例的数据限流方法可以由数据限流装置实现,上述数据限流装置可以设置在上游服务器与下游服务器之间,具体地,上述数据限流装置可以集成在上游服务器或下游服务器中,实现对上游服务器发往下游服务器的数据进行限流的功能。或者,上述数据限流装置也可以设置在一独立的服务器中或者作为一独立的服务器,该独立的服务器位于上游服务器与下游服务器之间, 实现对上游服务器发往下游服务器的数据进行限流的功能。
如图1所示,该数据限流方法可以包括:
步骤101,计算接收到的数据的局部敏感哈希(Locality Sensitive Hashing;以下简称:LSH)值。
具体地,上述接收到的数据即为上游服务器发出的数据,数据限流装置接收到上游服务器发出的数据之后,对其进行限流,再发送给下游服务器。
步骤102,根据上述数据的LSH值与已保存的至少一条数据的LSH值,计算上述数据与上述至少一条数据的相似值。
其中,上述已保存的至少一条数据可以为缓存中已保存的至少一条数据,上述缓存为在上述数据限流装置中或者在包含上述数据限流装置的服务器中开辟的缓存。
具体地,根据上述数据的LSH值与已保存的至少一条数据的LSH值,计算上述数据与上述至少一条数据的相似值可以为:计算上述数据的LSH值与上述至少一条数据的LSH值的差异值,并根据上述差异值计算上述数据与上述至少一条数据的相似值。
其中,数据限流装置根据上述差异值计算上述数据与上述至少一条数据的相似值可以为:按照式(1)计算上述数据与上述至少一条数据的相似值。
Figure PCTCN2016081216-appb-000001
其中,Di为上述数据的LSH值与上述至少一条数据的LSH值的差异值;Si为上述数据与上述至少一条数据的相似值,i为整数,i≥1。
本实施例中,上述数据的LSH值与上述至少一条数据的LSH值的差异值可以为上述数据的LSH值与上述至少一条数据的LSH值的汉明距离(Hamming Distance;以下简称:HD)。
步骤103,根据上述相似值确定是否保存上述数据。
具体地,数据限流装置根据上述相似值确定是否保存上述数据可以为:数据限流装置根据上述相似值中的最大值和预定的抽样率计算上述数据的通过概率;如果上述通过概率大于或等于预设阈值,则保存上述数据;而如果上述通过概率小于上述预设阈值,则不保存上述数据。其中,上述预设阈值可以在具体实现时,根据实现需求和/或***性能等自行设定,本实施例对上述预设阈值的大小不作限定,举例来说,该预设阈值可以 为50%。
具体地,保存上述数据可以为:将上述数据存入上述缓存。进一步地,在保存上述数据之后,数据限流装置还可以将缓存中保存的数据发送给下游服务器,从而实现了对上游服务器发出的数据进行限流后,发送给下游服务器。
其中,数据限流装置根据上述相似值中的最大值和预定的抽样率计算上述数据的通过概率可以为:按照式(2)计算上述数据的通过概率。
Figure PCTCN2016081216-appb-000002
其中,P为上述数据的通过概率;L为预定的抽样率,例如:L可以为75%;Si为上述数据与上述至少一条数据的相似值,i为整数,i≥1;Max(Si)为上述相似值中的最大值。
上述实施例中,数据限流装置计算接收到的数据的LSH值,然后根据上述数据的LSH值与已保存的至少一条数据的LSH值,计算上述数据与至少一条数据的相似值,最后根据上述相似值确定是否保存上述数据,从而可以实现根据数据的相似程度和差异,去除数据,并可以优先去除相同数据,从而可以使限流后的数据的多样性最大化。
下面以电商平台交易数据为例对本申请提供的数据限流方法进行说明。假设有一***,需要对交易数据进行实时抽样检查,并尽可能保留抽样数据的多样性,预定的抽样率为75%,即需要限流去除25%的流量。
假设按序号顺序,交易数据如表1所示。
表1
Figure PCTCN2016081216-appb-000003
Figure PCTCN2016081216-appb-000004
从表1中可以看出,1号和4号数据完全一样,基于表1中的交易数据,8条数据,预定抽样率75%(即限流25%),实际希望被去除的是差异最小的2条数据为:4号(与1号无差异)和2号(与1号仅购买数量不同),即保留1号、3号、5号、6号、7号和8号数据。
本申请使用LSH,使抽样得到的数据,尽量多样化,保留足够的数据差异性,可以解决随机限流方案数据多样性丢失的问题,也可以解决Hash方案,只可判断相近、无法判断相似的问题,即Hash后无法保留原始内容的差异程度的问题。
LSH的计算方法有很多种,如Jaccard、SimHash或MinHash等,本申请以一种64位SimHash的实现为例,表1中各序号数据对应的SimHash值可以如表2所示(每位上的0/1为1个比特(Bit)位,1个SimHash值可用64个Bit位存储)。
表2
Figure PCTCN2016081216-appb-000005
图2为本申请数据限流方法另一个实施例的流程图,如图2所示,该数据限流方法可以包括:
步骤201,开辟缓存空间。
其中,该缓存空间为在数据限流装置中或者在包含上述数据限流装置的服务器中开辟的缓存空间,用于缓存上游服务器发出的最近N条最新数据的LSH值。N可以根据实际情况进行配置,建议为5分钟内全量值,超过1024时,上限为1024,以保证内存限制在几K。
本实施例中,由于数量关系,可以假设N=3,数据限流装置首先按序号顺序,对流量数据,计算并缓存LSH值,1号数据流入数据限流装置后,缓存如表3所示。
表3
缓存
1010111101001111111100101101100011010100100100110011011101000010
步骤202,新流量数据进来时,计算待存入缓存的数据的LSH值,并计算待存入缓存的数据的LSH值与上述缓存中的至少一条数据的LSH值的差异值。
本实施例中,用汉明距离来HD表示上述差异值。
2号数据流入后,数据限流装置计算2号数据的LSH值,2号数据的LSH值可以如表4所示。
表4
1010111101001111111101101101100011010100100100110011011101000010
然后数据限流装置计算与缓存内1号数据的LSH值的差异值,其中,HD的计算方法可以为:LSH值不同的对应位的数量,即比较2号数据的LSH值与1号数据的LSH值的各位上的差异,有多少位不同,则HD为多少。优选地,在计算HD时,可通过异或(xor)快速计算。
本实施例中,2号数据的LSH值与1号数据的LSH值的对比可以如表5所示。
表5
Figure PCTCN2016081216-appb-000006
从表5中可以看出,2号数据的LSH值与1号数据的LSH值仅有1位不同,于是可以得出HD=1。
步骤203,根据上述差异值计算待存入缓存的数据与上述缓存中的至少一条数据的相似值。
由于HD越大,相似值越低。不同场景下,HD与相似值的对应关系并不固定,在64位SimHash场景下,经测试得到:HD=1时,相似的准确率,接近85%;而HD=10时,相似的准确率,不到30%。
于是数据限流装置可以按照式(1)计算上述相似值。
根据式(1)可以计算获得2号数据的LSH值与1号数据的LSH值的相似度为:S=0.93。
步骤204,根据上述相似值中的最大值和预定的抽样率计算上述待存入缓存的数据的通过概率。
其中,上述通过概率可以按照式(2)进行计算,数据限流装置根据式(2)可以计算获得2号数据的通过概率为5.83%。
步骤205,判断上述通过概率是否大于或等于预设阈值。如果是,则执行步骤206;如果上述通过概率小于预设阈值,则执行步骤207。
其中,上述预设阈值可以在具体实现时,根据实现需求和/或***性能等自行设定,本实施例对上述预设阈值的大小不作限定。但本实施例中,以该预设阈值为50%为例进行说明。
步骤206,将上述待存入缓存的数据存入上述缓存,本次流程结束。
步骤207,不将上述待存入缓存的数据存入上述缓存,本次流程结束。
由于2号数据的通过概率为5.83%,远低于50%,因此不将2号数据存入上述缓存,退出本次流程。
接下来,可以重复步骤202~步骤207,对3号~8号数据进行限流。由于2号数据未被存入缓存,因此缓存中的数据如表6所示。
表6
2号后缓存
1010111101001111111100101101100011010100100100110011011101000010
3号数据进入数据限流装置时,3号数据的LSH值与1号数据的LSH值的HD为10,于是可以计算获得3号数据的通过概率为55.6%,大于50%,于是将3号数据存入缓存,这时缓存中的数据如表7所示。
表7
3号后缓存
1010111101001111111100101101100011010100100100110011011101000010
1011111111000110111110101000100011010000100100100011011101100010
4号数据进入数据限流装置时,4号数据的LSH值与1号数据的LSH值的HD为0,4号数据的LSH值与3号数据的LSH值的HD为10。于是可以计算获得4号数据的通过概率为0,因此不将4号数据存入上述缓存,缓存中的数据仍如表7所示。
5号数据进入数据限流装置时,5号数据的LSH值与1号数据的LSH值的HD为9,5号数据的LSH值与3号数据的LSH值的HD为11。于是可以计算获得5号数据的通过概率为50%,因此将5号数据存入缓存,这时缓存中的数据如表8所示。
表8
5号后缓存
1010111101001111111100101101100011010100100100110011011101000010
1011111111000110111110101000100011010000100100100011011101100010
1010111111000110111010100101100011000100101100100011001101100010
继续6号、7号和8号数据,在此不再赘述。
上述数据限流方法中,数据限流装置可以按照数据相似程度的差异,进行数据限流,优先去除相同数据,从而可以使限流后的数据的多样性最大化。
图3为本申请数据限流装置一个实施例的结构示意图,本实施例中的数据限流装置可以实现本申请图1所示实施例的流程,如图3所示,该数据限流装置可以包括:计算模块31和确定模块32;
其中,计算模块31,用于计算接收到的数据的LSH值,并根据上述数据的LSH值与已保存的至少一条数据的LSH值,计算上述数据与上述至少一条数据的相似值;其中,计算模块31,具体用于计算上述数据的LSH值与已保存的至少一条数据的LSH值的差异值,并根据上述差异值计算上述数据与上述至少一条数据的相似值。其中,计算模块31计算的上述差异值可以为上述数据的LSH值与上述至少一条数据的LSH值的汉明距离;具体地,计算模块31可以按照式(1)计算上述数据与上述至少一条数据的相似值。
确定模块32,用于根据计算模块31计算的相似值确定是否保存上述数据。
上述数据限流装置可以设置在上游服务器与下游服务器之间,具体地,上述数据限流装置可以集成在上游服务器或下游服务器中,实现对上游服务器发往下游服务器的数据进行限流的功能。或者,上述数据限流装置也可以设置在一独立的服务器中或者作为一独立的服务器,该独立的服务器位于上游服务器与下游服务器之间,实现对上游服务 器发往下游服务器的数据进行限流的功能。上述接收到的数据即为上游服务器发出的数据,数据限流装置接收到上游服务器发出的数据之后,对其进行限流,再发送给下游服务器。
其中,上述已保存的至少一条数据可以为缓存中已保存的至少一条数据,上述缓存为在上述数据限流装置中或者在包含上述数据限流装置的服务器中开辟的缓存。
上述实施例中,计算模块31计算接收到的数据的LSH值,并根据上述数据的LSH值与已保存的至少一条数据的LSH值,计算上述数据与至少一条数据的相似值;然后,确定模块32根据上述计算模块31计算的相似值确定是否保存上述数据,从而可以实现根据数据的相似程度和差异,去除数据,并可以优先去除相同数据,从而可以使限流后的数据的多样性最大化。
图4为本申请数据限流装置另一个实施例的结构示意图,与图3所示的数据限流装置相比,不同之处在于,图4所示的数据限流装置中,确定模块32可以包括:概率计算子模块321和存入子模块322;
其中,概率计算子模块321,用于根据计算模块31计算的相似值中的最大值和预定的抽样率计算上述数据的通过概率;具体地,概率计算子模块321可以按照式(2)计算上述数据的通过概率。
存入子模块322,用于当概率计算子模块321计算的通过概率大于或等于预设阈值时,保存上述数据。其中,上述预设阈值可以在具体实现时,根据实现需求和/或***性能等自行设定,本实施例对上述预设阈值的大小不作限定,举例来说,该预设阈值可以为50%。
具体地,保存上述数据可以为:存入子模块322将上述数据存入上述缓存。进一步地,在保存上述数据之后,数据限流装置还可以将缓存中保存的数据发送给下游服务器,从而实现了对上游服务器发出的数据进行限流后,发送给下游服务器。
上述数据限流装置,可以按照数据相似程度的差异,进行数据限流,优先去除相同数据,从而可以使限流后的数据的多样性最大化。
需要说明的是,在本申请的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一 个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行***执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(Programmable Gate Array;以下简称:PGA),现场可编程门阵列(Field Programmable Gate Array;以下简称:FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (8)

  1. 一种数据限流方法,其特征在于,包括:
    计算接收到的数据的局部敏感哈希值;
    根据所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算所述数据与所述至少一条数据的相似值;
    根据所述相似值确定是否保存所述数据。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算所述数据与所述至少一条数据的相似值包括:
    计算所述数据的局部敏感哈希值与所述至少一条数据的局部敏感哈希值的差异值;
    根据所述差异值计算所述数据与所述至少一条数据的相似值。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述相似值确定是否保存所述数据包括:
    根据所述相似值中的最大值和预定的抽样率计算所述数据的通过概率;
    如果所述通过概率大于或等于预设阈值,则保存所述数据。
  4. 根据权利要求2所述的方法,其特征在于,所述数据的局部敏感哈希值与所述至少一条数据的局部敏感哈希值的差异值包括所述数据的局部敏感哈希值与所述至少一条数据的局部敏感哈希值的汉明距离。
  5. 一种数据限流装置,其特征在于,包括:
    计算模块,用于计算接收到的数据的局部敏感哈希值,并根据所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值,计算所述数据与所述至少一条数据的相似值;
    确定模块,用于根据所述计算模块计算的相似值确定是否保存所述数据。
  6. 根据权利要求5所述的装置,其特征在于,
    所述计算模块,具体用于计算所述数据的局部敏感哈希值与已保存的至少一条数据的局部敏感哈希值的差异值,并根据所述差异值计算所述数据与所述至少一条数据的相似值。
  7. 根据权利要求5或6所述的装置,其特征在于,所述确定模块包括:
    概率计算子模块,用于根据所述计算模块计算的相似值中的最大值和预定的抽样率计算所述数据的通过概率;
    存入子模块,用于当所述概率计算子模块计算的通过概率大于或等于预设阈值时,保存所述数据。
  8. 根据权利要求6所述的装置,其特征在于,
    所述计算模块计算的差异值包括所述数据的局部敏感哈希值与所述至少一条数据的局部敏感哈希值的汉明距离。
PCT/CN2016/081216 2015-05-15 2016-05-06 数据限流方法和装置 WO2016184316A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510250007.3A CN106302202B (zh) 2015-05-15 2015-05-15 数据限流方法和装置
CN201510250007.3 2015-05-15

Publications (1)

Publication Number Publication Date
WO2016184316A1 true WO2016184316A1 (zh) 2016-11-24

Family

ID=57319444

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081216 WO2016184316A1 (zh) 2015-05-15 2016-05-06 数据限流方法和装置

Country Status (2)

Country Link
CN (1) CN106302202B (zh)
WO (1) WO2016184316A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158967A (zh) * 2007-11-16 2008-04-09 北京交通大学 一种基于分层匹配的快速音频广告识别方法
CN102722554A (zh) * 2012-05-28 2012-10-10 中国人民解放军信息工程大学 位置敏感哈希随机性减弱方法
CN102929891A (zh) * 2011-08-11 2013-02-13 阿里巴巴集团控股有限公司 处理文本的方法和装置
EP2685404A2 (en) * 2012-07-10 2014-01-15 Facebook, Inc. Method and system for determining image similarity
CN103530812A (zh) * 2013-07-25 2014-01-22 国家电网公司 一种基于局部敏感哈希的电网状态相似度量化分析方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8050251B2 (en) * 2009-04-10 2011-11-01 Barracuda Networks, Inc. VPN optimization by defragmentation and deduplication apparatus and method
CN102622366B (zh) * 2011-01-28 2014-07-30 阿里巴巴集团控股有限公司 相似图像的识别方法和装置
CN102323958A (zh) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 重复数据删除方法
CN103916421B (zh) * 2012-12-31 2017-08-25 ***通信集团公司 云存储数据服务装置、数据传输***、服务器及方法
US9690711B2 (en) * 2013-03-13 2017-06-27 International Business Machines Corporation Scheduler training for multi-module byte caching
CN103258005B (zh) * 2013-04-12 2017-02-08 百度在线网络技术(北京)有限公司 一种用于对搜索结果进行处理的方法和装置
CN103559259A (zh) * 2013-11-04 2014-02-05 同济大学 基于云平台的消除近似重复网页方法
CN103744964A (zh) * 2014-01-06 2014-04-23 同济大学 一种基于局部敏感Hash函数的网页分类方法
CN103984753B (zh) * 2014-05-28 2018-02-09 北京京东尚科信息技术有限公司 一种网络爬虫去重特征值的提取方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158967A (zh) * 2007-11-16 2008-04-09 北京交通大学 一种基于分层匹配的快速音频广告识别方法
CN102929891A (zh) * 2011-08-11 2013-02-13 阿里巴巴集团控股有限公司 处理文本的方法和装置
CN102722554A (zh) * 2012-05-28 2012-10-10 中国人民解放军信息工程大学 位置敏感哈希随机性减弱方法
EP2685404A2 (en) * 2012-07-10 2014-01-15 Facebook, Inc. Method and system for determining image similarity
CN103530812A (zh) * 2013-07-25 2014-01-22 国家电网公司 一种基于局部敏感哈希的电网状态相似度量化分析方法

Also Published As

Publication number Publication date
CN106302202B (zh) 2020-07-28
CN106302202A (zh) 2017-01-04

Similar Documents

Publication Publication Date Title
US10097464B1 (en) Sampling based on large flow detection for network visibility monitoring
US9979624B1 (en) Large flow detection for network visibility monitoring
WO2019096122A1 (zh) 数据处理方法和装置
US10536360B1 (en) Counters for large flow detection
CN109525500B (zh) 一种自调整阈值的信息处理方法及信息处理装置
US20140101762A1 (en) Systems and methods for capturing or analyzing time-series data
US10171423B1 (en) Services offloading for application layer services
CN110162270B (zh) 基于分布式存储***的数据存储方法、存储节点及介质
US20160352598A1 (en) Message aggregation, combining and compression for efficient data communications in gpu-based clusters
US10003515B1 (en) Network visibility monitoring
US9276879B2 (en) Memory transfer optimization of network adapter data placement when performing header-data split operations
US9832125B2 (en) Congestion notification system
CN108073527B (zh) 一种缓存替换的方法和设备
WO2020134620A1 (zh) 一种受理区块链存证交易的方法及***
US20160034324A1 (en) Tracking a relative arrival order of events being stored in multiple queues using a counter using most significant bit values
US20190014016A1 (en) Data acquisition device, data acquisition method and storage medium
US20140173631A1 (en) Tracking a relative arrival order of events being stored in multiple queues using a counter
US8036217B2 (en) Method and apparatus to count MAC moves at line rate
US8830714B2 (en) High speed large scale dictionary matching
US11481569B2 (en) Highspeed/low power symbol compare
US9697127B2 (en) Semiconductor device for controlling prefetch operation
WO2015192668A1 (zh) 语音业务的评价处理方法及装置
US10069929B2 (en) Estimating cache size for cache routers in information centric networks
WO2017157164A1 (zh) 数据聚合方法及装置
WO2016184316A1 (zh) 数据限流方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16795803

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16795803

Country of ref document: EP

Kind code of ref document: A1