CN116775969A - Data stream processing method, device, electronic equipment and computer readable medium - Google Patents

Data stream processing method, device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN116775969A
CN116775969A CN202310745597.1A CN202310745597A CN116775969A CN 116775969 A CN116775969 A CN 116775969A CN 202310745597 A CN202310745597 A CN 202310745597A CN 116775969 A CN116775969 A CN 116775969A
Authority
CN
China
Prior art keywords
request data
time window
condition
data
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310745597.1A
Other languages
Chinese (zh)
Inventor
汪忠祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongdun Network Technology Co ltd
Original Assignee
Tongdun Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongdun Network Technology Co ltd filed Critical Tongdun Network Technology Co ltd
Priority to CN202310745597.1A priority Critical patent/CN116775969A/en
Publication of CN116775969A publication Critical patent/CN116775969A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a data stream processing method, a data stream processing device, electronic equipment and a computer readable medium, and relates to the technical field of computers. Determining request data of the current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data; determining index identifiers corresponding to each request data, wherein each index identifier points to one record; after the request data is classified based on the index identification degree to obtain the request data class, the request data class is written into the corresponding record according to the index identification under the condition that the writing condition is met. According to the method, the request data are batched in batches based on a dynamic time window before writing, so that a plurality of request data are uniformly written, IO consumption in data stream reading and writing is reduced, database configuration requirements are simplified, and service operation cost is reduced; and the time window is dynamically adjusted, so that the size of the time window is automatically adapted to the batch scraping effect, the dynamic data read-write requirement in the service is more met, and the storage performance is flexibly improved.

Description

Data stream processing method, device, electronic equipment and computer readable medium
Technical Field
The present disclosure relates to the field of computer technology, and in particular, to a data stream processing method, apparatus, electronic device, and computer readable medium.
Background
Aeropike (AS for short) is a distributed, extensible Key Value (Key Value) database, and its storage mode includes "namesp (namespace) -set (record) -bin (database field)".
When aeropike is used for data storage, the data structure of the core comprises record, bin, value and the like, and the hierarchy of the data structure can comprise [ record- (bin 1-value1, bin2-value2, ··bin-value n) ], and also can be [ record-bin (key 1-value1, key2-value2, ··key n-value n) ]. When the data is read and written, the former data structure can directly add and delete the bin-level data without reading the record first; and the latter data structure only uses 1 bin, can store a plurality of pieces of 'key n-value n' data on the basis of avoiding the limit of the number of bins, needs to read out the whole record each time when reading and writing, and then writes the record integrally after merging the 'key n+1-value n+1'.
Therefore, in the former data structure, the number of bins is limited, and it is difficult to meet a large number of data service requirements; in the latter data structure, the need to read and write the whole record in its entirety each time leads to extremely large IO consumption, requires high-cost database configuration to provide corresponding IO performance, network bandwidth, or requires support of a large-scale cluster.
Disclosure of Invention
The disclosure aims to provide a data stream processing method, a device, an electronic device and a computer readable medium, wherein the method can reduce IO consumption in data stream reading and writing, thereby improving the performance of a system and reducing the service operation cost.
According to a first aspect of the present disclosure, there is provided a data stream processing method, which may include: determining request data of the current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data; determining index identifiers corresponding to each request data, wherein each index identifier points to one record; classifying the request data based on the index identifier to obtain a request data class; and under the condition that the request data class meets the writing condition, writing the request data class into the corresponding record according to the index identification.
Optionally, the step of dynamically adjusting the time window based on the classification result of the request data includes: the time window is expanded when the requested data class of the current scratch meets the write condition.
Optionally, the step of dynamically adjusting the time window based on the classification result of the request data includes: acquiring request data of a previous scraping batch, classifying the acquired request data class based on the index identification, wherein the previous scraping batch is adjacent to the current scraping batch; determining a change state parameter of the request data class of the current batch relative to the request data class of the previous batch; the time window is dynamically adjusted based on the changing state parameters.
Optionally, dynamically adjusting the time window based on the change state parameter includes: expanding the time window under the condition that the change state parameter meets the expansion condition; in case the change state parameter does not meet the expansion condition, the time window is kept unchanged.
Optionally, expanding the time window in the case that the change state parameter meets the expansion condition includes: expanding the time window under the condition that the change state parameter meets the expansion condition and the time window is smaller than the maximum window; and under the condition that the change state parameter meets the expansion condition and the time window is equal to the maximum window, keeping the time window unchanged.
Optionally, the step of dynamically adjusting the time window based on the classification result of the request data further includes: under the condition that the time window is equal to the maximum window, acquiring a continuous state parameter of the time window, wherein the continuous state parameter comprises at least one of window duration and window batch number; and in the case that the persistence state parameter is greater than or equal to the persistence state threshold value, reducing the time window to a minimum window.
Optionally, classifying the request data based on the index identifier, and after obtaining the request data class, further includes: expanding a time window under the condition that the request data class does not accord with the writing condition; and determining the request data of the current batch based on the expanded time window, and cycling until the request data class under the time window meets the writing condition.
According to a second aspect of the present disclosure there is provided a data stream processing apparatus, the apparatus may comprise: the data acquisition module is used for determining the request data of the current batch based on a time window, and the time window is dynamically adjusted and acquired based on the classification result of the request data; the data index module is used for determining index identifiers corresponding to each request data, and each index identifier points to one record; the data classification module is used for classifying the request data based on the index identifier to obtain a request data class; and the data writing module is used for writing the request data class into the corresponding record according to the index identifier under the condition that the request data class meets the writing condition.
Optionally, the apparatus further comprises a window adjustment module for expanding the time window in case the requested data class of the next scratch meets the writing condition.
Optionally, the window adjustment module is further configured to obtain request data of a previous scraping batch, which is adjacent to the current scraping batch, and classify the obtained request data based on the index identifier; determining a change state parameter of the request data class of the current batch relative to the request data class of the previous batch; the time window is dynamically adjusted based on the changing state parameters.
Optionally, the window adjustment module is specifically configured to expand the time window when the change state parameter meets an expansion condition; in case the change state parameter does not meet the expansion condition, the time window is kept unchanged.
Optionally, the window adjustment module is specifically configured to expand the time window when the change state parameter meets the expansion condition and the time window is smaller than the maximum window; and under the condition that the change state parameter meets the expansion condition and the time window is equal to the maximum window, keeping the time window unchanged.
Optionally, the window adjustment module is further configured to obtain a duration state parameter of the time window when the time window is equal to the maximum window, where the duration state parameter includes at least one of window duration time and number of window batch; and in the case that the persistence state parameter is greater than or equal to the persistence state threshold value, reducing the time window to a minimum window.
Optionally, the window adjustment module is further configured to expand the time window when the request data class does not meet the writing condition; and determining the request data of the current batch based on the expanded time window, and cycling until the request data class under the time window meets the writing condition.
According to a third aspect of the present disclosure there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data stream processing method of the first aspect described above.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising:
a processor; and
a memory for storing a computer program of the processor;
wherein the processor is configured to implement the data stream processing method of the first aspect described above via execution of a computer program.
According to a fifth aspect of the present disclosure, there is provided a computer program product which, when run on an electronic device, causes the electronic device to perform steps as implementing the data stream processing method as in the first aspect.
The data stream processing method provided by the disclosure is used for determining request data of a current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data; determining index identifiers corresponding to each request data, wherein each index identifier points to one record; further, the request data is classified based on the index identifier, and after the request data class is obtained, the request data class is written into the corresponding record according to the index identifier under the condition that the request data class meets the writing condition. Before writing, the method carries out batch scraping on the request data based on a dynamic time window and classifies the request data based on an index identifier corresponding to record, so that a plurality of request data are uniformly written, the data reading and writing times are reduced, IO consumption in data stream reading and writing is reduced, the configuration requirement of a database is simplified, the service support of a large-scale cluster is not needed, and the service operation cost is reduced; and the time window can be dynamically adjusted according to the classification result of the request data, so that the size of the time window is automatically adapted to the batch scraping effect, the dynamic data read-write requirement in the service is met, and the storage performance is flexibly improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
Fig. 1 is a flowchart of one step of a data stream processing method according to an embodiment of the present disclosure.
Fig. 2 is a second flowchart of a data flow processing method according to an embodiment of the present disclosure.
Fig. 3 is a schematic operation flow diagram of a data stream processing method in an embodiment of the disclosure.
Fig. 4 is a schematic structural diagram of a data stream processing device according to an embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
It should be noted that, the data obtained in the present disclosure, including the data such as the request data, all after explicitly informing the user or the party to whom the related data belongs about the information such as the collection content, the data usage, the processing mode, etc. of the data, access, collect, store and apply to the subsequent analysis processing under the condition that the user or the party to whom the related data belongs agrees and authorizes, and may send the way of accessing, correcting and deleting the data to the user or the party to which the related data belongs, and cancel the agreeing and authorizing method.
In current aeropike data stores, a typical data hierarchy may be as follows:
3123yh1qaz23wqaswq2we3wqedsaqwe3{
DATA1{
"ipaddress precursor: shaanxi, phone:17676555418"
}
DATA2{
"ipaddress precursor: zhejiang, phone:17676222418"
}
}
Wherein "3123yh1qaz, 23wqaswq2we3wqedsaqwe3" is a key of record available from (hash (phone: 17676555418); "DATA1" is the name of the bin; "ipaddress precursor: shaanxi, phone:17676555418" is a specific value, and so on.
On the basis, the corresponding bins under the record can be respectively operated when the data is read and written, the whole record is not required to be read, for example, 500 bins under the record are required to be added, and the bins can be directly added when one bin is required to be added, and the 500 bins of the whole record are not required to be read. However, the storage structure is limited by the number of bins, for example, the number of bins before the aeropipke 5.0 version is limited to 32767, the number of bins after the aeropipke 5.0 version is limited to 65535, the data carrying capacity of the system is low, and the increasing data processing requirements in the business are difficult to meet.
Alternatively, the hierarchical structure of typical data may also be as follows:
3123yh1qaz23wqaswq2we3wqedsaqwe3{
DATA{
1872653872927{
"ipaddress precursor: shaanxi, phone:17676555418"
}
1872653872928{
"ipaddress precursor: heilongjiang, phone:17676534318"
}
1872653872929{
"ipaddress precursor: hebei, phone:17676534318"
}
}
}
Wherein "3123yh1qaz, 23wqaswq2we3wqedsaqwe3" is a key of record available from (hash (phone: 17676555418); "DATA" is the name of the bin; "1872653872927" is a key for a particular value; "ipaddress precursor: shaanxi, phone:17676555418" is a specific value, and so on. It should be noted that each pair of value and its corresponding key under the bin may be regarded as a whole data processing, so that each piece of data under the bin is internally designed to be a key-value (hereinafter referred to as k-v) structure, so that multiple pieces of data may be stored based on a single bin.
On the basis, the number of the bins is controlled under the record, and the internal structure of each piece of data in the bins is regulated, so that a plurality of pieces of data are stored under one bin, the data storage is not influenced by the limit of the number of the bins, and is limited only by the whole size of the record, namely the upper limit of the whole size of the record before the Aerospike 4.2 version is 1MB, and the upper limit of the whole size of the record after the Aerospike 4.2 version is 8MB, and any number of data can be stored in the bins under the condition that the storage size limit is not exceeded. In practical application, one node of the single nano space can accommodate 4,294,967,296 records, each record can store 8MB of data, and therefore, the data storage structure can meet the requirement of a business scene on data storage.
However, in this data structure, the object of each read-write actual operation is a specific k-v in the bin, and in practical application, only the bin level can be directly operated. To flexibly handle such data structures, an aeropike-supported UDF (User-Defined Function) may be used to operate, such as programming a program that performs UDF operations using lua script. The minimum unit of each UDF operation when reading and writing data is record, namely the whole record needs to be read out, and the record is written back after the data is processed, for example, 1 bin exists under the record, 500 k-v exist under the bin, when 1 k-v is added under the bin, the record needs to be read out integrally, and after the k-v needing to be added is combined, the record is written back integrally.
In the foregoing read-write operation, if there are 1 bin under record, there are 1024 k-v under the bin, and each k-v has a size of 2kb. When a request is initiated at a user side and a 2kb k-v is to be written, 1024 x 2kb data under the record is read out from the aerosepike disk to the aerosepike memory, the 2kb k-v to be written is combined in the aerosepike memory through UDF operation, and then 1024 x 2kb+2kb record is written back from the aerosepike memory to the aerosepike disk. It can be seen that in this process, in order to write 2kb of data, a 2MB read and a 2MB write are generated. On this basis, only 1024tps (Transaction per Second, traffic per second) writes in a high concurrency scenario would result in 4G IO consumption.
Therefore, the IO amplification phenomenon is serious when reading and writing under the data storage structure, and the system performance is seriously affected by the IO amplification problem under the service scene of high TPS, so that the database configuration or the cluster scale needs to be further improved to meet the service requirement, and the operation cost is high.
In the embodiment of the disclosure, a batch scraping processing method is provided before data is read and written, request data of each batch is determined based on a dynamically adjusted time window, and the request data is classified based on index identifiers corresponding to records and then is intensively written, so that the writing times are reduced, the network resource allocation requirement is reduced, and the operation cost is reduced. The specific description with reference to the drawings is as follows:
Fig. 1 is one of the step flowcharts of a data stream processing method according to an embodiment of the present disclosure, as shown in fig. 1, the method may specifically include steps 101 to 104 as follows.
Step 101, determining the request data of the current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on the classification result of the request data.
In the embodiment of the disclosure, when the application memory receives the request data of the user in the process of service operation, the received request data can be classified and batched before the read-write operation, different request data correspond to different request time, and the request data which are batched each time can be determined in a dynamically adjusted time window. If the data collection is started at 10 time 0 and the time window is 10 minutes, the request data with the request time between 10 time 0 and 10 time 10 minutes can be used as the request data of the first collection, the request data with the request time between 10 time 10 and 10 time 20 minutes can be used as the request data of the second collection, and the like; further, when the request data received between 10 hours 0 and 10 hours 10 is scraped, the first scraping is the next scraping, when the request data received between 10 hours 10 and 10 hours 20 is scraped, the second scraping is the next scraping, the first scraping is the previous scraping, and so on.
In the embodiment of the disclosure, the time window can be dynamically adjusted based on the classification result of the request data, in general, the longer the time window is, the fewer the total number of times of writing is needed, but the application memory is generally limited, the service data has a certain requirement on real-time performance, but the condition that the request data cannot be effectively classified can occur when the time window is short, so the time window can be dynamically adjusted based on the classification result of the request data, the time window is enlarged when the classification result has further optimized space, and the time window is shortened when the service quality is influenced, thereby flexibly adapting to the practical application environment of data stream processing.
For example, 10 pieces of request data on the user side are received as follows:
{ phone: 176800987, address: shaan
Western, ip 127.12.13.1, blackbox 12, ffwer3223f93jf8sss8f22fsfs, time 10:01:01}
{ phone: 176800985, address: shaan
Western, ip 127.12.13.2, blackbox:22ffwer3223f93jf8sss8f12fsfs, time:10:01:02}
{ phone: 176800983, address: shaan
Western, ip 127.12.13.4, blackbox: c2ffwer3223f93jf8sss8fbyfs, time:10:01:04}
{ phone: 176800987, address: shaan
Western, ip 127.12.13.9, blackbox 42ffwer3223f93jf8sss8f32fsfs, time 10:01:08}
{ phone: 176800983, address: shaan
Western, ip 127.12.13.7, blackbox g2ffwer3223f93jf8sss8f20fsfs, time:10:01:10}
{ phone: 176800987, address: shaan
Western, ip 127.12.23.6, blackbox: k2ffwer3223f93jf8sss8fhhfsfs, time:10:02:01}
{ phone: 176809681, address: shaan
Western, ip 127.12.43.6, blackbox: k2ffwer3223f93jf8sss8fhhfsfs, time:10:02:06}
{ phone: 176809681, address: shaan
Western, ip 117.13.13.6, blackbox: k2ffwer3223f93jf8sss8fhhfsfs, time:10:03:01}
{ phone: 176809681, address: shaan
Western, ip 147.12.13.6, blackbox: k2ffwer3223f93jf8sss8fhhfsfs, time:10:03:02}
{ phone: 176800987, address: shaan
Western, ip 187.12.13.6, blackbox: k2ffwer3223f93jf8sss8fhhfsfs, time:10:03:05}
In the case that the time window is 1min, the first batch of request data includes "time:10:01:01", "time:10:01:02", "time:10:01:04", "time:10:01:08", "time:10:01:10"; the request data of the second scraping batch comprises 'time: 10:02:01', 'time: 10:02:06'; the request data of the third batch includes "time:10:03:01", "time:10:03:02", "time:10:03:05".
In the case that the time window is 2 minutes, the request data of the first batch includes "time:10:01:01", "time:10:01:02", "time:10:01:04", "time:10:01:08", "time:10:01:10", "time:10:02:01", "time:10:02:06"; the request data of the second scraping batch comprises 'time: 10:03:02', 'time: 10:03:05';
In the case of a time window of 3 minutes, the request data of the first batch includes all 10 pieces of the above request data.
Step 102, determining index identifiers corresponding to each piece of request data, wherein each index identifier points to one record.
In the disclosed embodiment, the index identifier refers to a key that the aeropike can identify to point to the corresponding record, or other data used to generate the key. For example, when aerosepike uses a hash value "hash (phone number)" of a phone number as a key of a record, the phone number in the request data may be used as an index identifier corresponding to the phone number.
For example, for the 10 pieces of request data, the mobile phone number corresponding to each piece of request data may be determined in the request data of each batch.
Taking the time window of 3min as an example, it may be determined that the index identifier corresponding to each piece of request data is as follows:
phone:17629000987
phone:17629000985
phone:17629000983
phone:17629000987
phone:17629000983
phone:17629000987
phone:17629020981
phone:17629020981
phone:17629020981
phone:17629000987
step 103, classifying the request data based on the index identifier to obtain a request data class.
In the embodiment of the disclosure, the index identifier may point to one record, that is, the request data with the same index identifier needs to be written into the same record, while the request data with different index identifiers needs to be written into different records, which do not affect each other. Thus, the request data may be classified in each batch based on the index identification, resulting in a corresponding request data class. The number of request data classes should be smaller than the total number of request data so that the number of reads of records at writing can be reduced.
For example, the request data is classified based on the number of the mobile phone, and the obtained request data class "phone:17629020981" includes 3 request data, "phone:17629000983" includes 2 request data, "phone:17629000985" includes 1 request data, and "phone:17629000987" includes 4 request data.
And 104, writing the request data class into the corresponding record according to the index identification under the condition that the request data class meets the writing condition.
On the basis, different request data types can be written into the record corresponding to the index identifier, and each type can comprise more than one request data, so that the request data needing to be written in for many times is written in once, and the integral reading times of the record are reduced.
For example, after dividing 10 pieces of request data into 4 pieces of request data based on a mobile phone number, the number of read-write operations is reduced from 10 to 4, in the conventional read-write operations, IO is 10×2mb+ (2mb+2kb))=40 MB, and in the method provided in the embodiment of the present disclosure, IO is (2mb+ (2mb+4×2kb))+ (2mb+ (2mb+3×2kb))+ (2mb+2mb+2×2kb))+ (2mb+1×2mb) +2mb) =16 MB, so that IO consumption is significantly reduced.
Fig. 2 is a second flowchart of a data stream processing method according to an embodiment of the present disclosure, in which a time window is dynamically adjusted with a classification result of request data, as shown in fig. 2:
Step 201, determining request data of the current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data.
In the embodiment of the disclosure, step 201 may correspond to the related description of step 101, and is not repeated here.
Step 202, determining index identifiers corresponding to each request data, wherein each index identifier points to one record.
In the embodiment of the disclosure, step 202 may correspond to the related description of step 102, and is not repeated here.
Step 203, classifying the request data based on the index identifier to obtain a request data class.
In the embodiment of the disclosure, step 203 may correspond to the related description of step 103, and is not repeated here.
And 204, under the condition that the request data class meets the writing condition, writing the request data class into the corresponding record according to the index identification.
In the embodiment of the disclosure, step 204 may correspond to the related description of step 104, and is not repeated here.
In an alternative method embodiment of the present disclosure, the step of dynamically adjusting the time window based on the classification result of the request data may include the following step 205:
Step 205, when the requested data class of the scraping batch meets the writing condition, the time window is enlarged.
In the embodiment of the disclosure, the writing condition may be a condition that the obtained request data class meets a batch processing requirement, and the batch processing requirement may be the number of processed request data, the number of obtained request data classes, the number of request data contained in each request data class, and the like. When the request data of the next batch meets the writing condition, the time window can be considered to at least partially meet the batch processing requirement, and at the moment, whether the classification effect of the request data has further improved space can be detected by further expanding the time window.
In the embodiment of the disclosure, the same time may be expanded each time when the time window is expanded, for example, fixed time such as 30s and 1min may be expanded each time based on the existing time window, or the expansion amount may be reduced according to the number and the amplitude of the expanded time window, for example, 2min for the first time, 1min for the second time, 30s for the third time, which is not particularly limited in the embodiment of the disclosure.
In an alternative method embodiment of the present disclosure, the step of dynamically adjusting the time window based on the classification result of the request data may include the following steps 206 to 208:
Step 206, obtaining the request data of the previous batch, classifying the obtained request data class based on the index identifier, wherein the previous batch is adjacent to the current batch;
step 207, determining a change state parameter of the request data class of the current batch relative to the request data class of the previous batch;
step 208, dynamically adjusting the time window based on the change state parameter.
In the embodiment of the disclosure, the change state parameter may represent a change condition of the request data class in two adjacent batches, and may be the number of request data classes, the number of request data in each request data class, and the like. Thus, the dynamic adjustment of the time window may be determined by comparing the state parameters of the changes in the requested class of data obtained from two adjacent batches and based on the state parameters of the changes. The time window may be further enlarged when the change state parameter indicates that the classification effect of the batch is improved, and the time window may be reduced when the change state parameter indicates that the classification effect of the batch is reduced, so as to obtain a suitable time window.
In an alternative method embodiment of the present disclosure, step 208 may include the following steps S11 to S12.
Step S11, when the change state parameter meets the expansion condition, expanding the time window.
And step S12, when the change state parameter does not meet the expansion condition, keeping the time window unchanged.
In the embodiment of the disclosure, the expansion condition may refer to a condition that indicates that the classification effect of the batch is improved, for example, the number of request data included in the request data class is greater, so that the effect of reducing the number of read-write operations is better, and the like. Therefore, when the change state parameter meets the expansion condition, the expansion of the time window in two adjacent batches can be represented to improve the batch effect, and the batch effect has a further improved space, so that the time window can be further expanded; similarly, when the variable state parameter does not meet the expansion condition, it may indicate that the expansion of the time window in two adjacent batches has no, limited or reverse influence on the batch effect, and the batch effect has no further space to be promoted, and at this time, the existing time window may be kept unchanged.
In an alternative method embodiment of the present disclosure, step S11 may include the following steps S111 to S112.
And step S111, expanding the time window when the change state parameter meets the expansion condition and the time window is smaller than the maximum window.
Step S112, when the change state parameter meets the expansion condition and the time window is equal to the maximum window, keeping the time window unchanged.
In the embodiment of the disclosure, the maximum time window may be determined based on the application memory, and when the time of scraping exceeds the maximum time window, the number of the request data received by the application memory may affect the normal operation of the application memory, so that it can be seen that the maximum time window is related to the size of the application memory and the flow of the request data, and can be dynamically adjusted according to the actual situation. It can be seen that, in the process of dynamically adjusting the time window, the size of the time window should be not greater than the maximum window, so as to avoid possible problems of the application memory. Therefore, on the basis that the change state parameters meet the expansion conditions, the time window is expanded under the condition that the time window is smaller than the maximum window, and the expanded time window can be prevented from being larger than the maximum window by adjusting the expansion amplitude in the expansion process; when the variable state parameter accords with the expansion condition and the time window is equal to the maximum window, the time window can be kept unchanged, and the possible problems of the application memory are avoided while the better batch scraping effect is obtained.
In an alternative method embodiment of the present disclosure, the step of dynamically adjusting the time window based on the classification result of the request data may include the following steps 209 to 210:
in step 209, in the case where the time window is equal to the maximum window, a duration state parameter of the time window is acquired, where the duration state parameter includes at least one of a window duration time and a window batch number.
Step 210, in the case that the persistence state parameter is greater than or equal to the persistence state threshold value, the time window is narrowed to a minimum window.
In the embodiment of the disclosure, in the case that the time window has reached the maximum window, a duration state parameter of the time window may be obtained, where the duration state parameter is used to represent a batch processing state that has been continuously performed based on the time window, and may include a window duration time, that is, an accumulated time of batch processing that has been performed, and may also include a window batch count, that is, an accumulated number of batch processing that has been performed. The persistence state threshold may be a regression threshold for a time window, and when the persistence state parameter reaches the persistence state threshold, the time window that has reached the maximum window may be directly narrowed back to the minimum window. Depending on the persistence state parameter, the persistence state threshold may be a window duration threshold, a window scraping number threshold, or the like. By the method, the time window is reduced, step comparison of batch scraping effect is not needed, and complexity of system processing can be reduced. After the time window is reduced, the adjustment interval of the window is adjusted, so that the time window can be dynamically adjusted to a proper state in several batches, and the influence possibly caused by directly reducing the time window is reduced.
In an alternative method embodiment of the present disclosure, the step of dynamically adjusting the time window based on the classification result of the request data may include the following steps 211 to 212:
step 211, when the request data class does not meet the writing condition, the time window is enlarged.
Step 212, determining the request data of the current batch based on the expanded time window, and looping to the request data class under the time window to meet the writing condition.
In the embodiment of the disclosure, since the aggregation effect of the request data in the conventional service is generally not obvious, when the initial time window in the early stage is smaller, the situation that batch collection cannot be performed, and more than two pieces of request data cannot be separated from any request data class may occur, and compared with the case of directly writing each request data, the number of read-write operations is not reduced or the reduction degree is limited, the obtained request data class is considered to be not in accordance with the writing condition, so that the time window can be expanded, batch collection can be performed again at the present time, and the cyclic operation is performed until the request data class is in accordance with the writing condition.
In the embodiment of the disclosure, after dynamically adjusting the time window, each batch scraping may be continuously performed based on the adjusted time window, and one or more steps of dynamically adjusting the time window based on the classification result of the request data after each batch scraping is performed, so that the time window dynamically maintains a proper state during the process of circularly performing batch scraping for multiple times.
It should be noted that, in the foregoing steps 211 to 212, the time window is adjusted in the same batch to make the request data class conform to the writing condition so as to start the effective batch; steps 205, 206-208, 209-210, etc. are to adjust the time window among the plurality of batch lots to promote or maintain the batch effect in a proper state.
Fig. 3 is a schematic operation flow chart of a data stream processing method according to an embodiment of the disclosure, in which each time the data stream processing method is amplified by 30s in a time window, the minimum time window is 30s, and the window duration threshold is 30min, as shown in fig. 3, including steps 301 to 30X.
Step 301, determining request data of the current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data.
Step 302, determining index identifiers corresponding to each request data, wherein each index identifier points to one record.
Step 303, classifying the request data based on the index identifier to obtain a request data class.
And step 304, under the condition that the request data class meets the writing condition, writing the request data class into the corresponding record according to the index identification.
Step 305, determining whether the time window is equal to a maximum time window; if not, go to step 306 to step 307; if yes, go to step 309.
Step 306, obtaining the request data of the previous batch, which is adjacent to the current batch, classifies the obtained request data class based on the index identification.
Step 307, determining a change state parameter of the request data class of the current batch relative to the request data class of the previous batch; if the state change parameter indicates that the classification effect of the requested data class in the two adjacent batches is improved, step 308 is performed.
Step 308, expanding the time window for 30s under the condition that the time window is smaller than the maximum window; and steps 301 to 304 are performed.
Step 309, in the case that the window duration reaches 30min, the time window is reduced to 30s; and steps 301 to 304 are performed.
The data stream processing method provided by the disclosure is used for determining request data of a current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data; determining index identifiers corresponding to each request data, wherein each index identifier points to one record; further, the request data is classified based on the index identifier, and after the request data class is obtained, the request data class is written into the corresponding record according to the index identifier under the condition that the request data class meets the writing condition. In the method, the request data is batched in batches based on a dynamic time window before writing, and the request data is classified based on the index identifier corresponding to the record, so that a plurality of request data are uniformly written, the data reading and writing times are reduced, IO consumption in data stream reading and writing is reduced, the configuration requirement of a database is simplified, the service support of a large-scale cluster is not needed, and the service operation cost is reduced; and the time window can be dynamically adjusted according to the classification result of the request data, so that the size of the time window is automatically adapted to the batch scraping effect, the dynamic data read-write requirement in the service is met, and the storage performance is flexibly improved.
Fig. 4 is a data stream processing apparatus 400 according to an embodiment of the disclosure, as shown in fig. 4, the apparatus may include: the data acquisition module 401 is configured to determine the request data of the current batch based on a time window, where the time window is dynamically adjusted and acquired based on a classification result of the request data; a data index module 402, configured to determine index identifiers corresponding to each request data, where each index identifier points to a record; a data classification module 403, configured to classify the request data based on the index identifier, and obtain a request data class; and the data writing module 404 is configured to write the request data class into the corresponding record according to the index identifier if the request data class meets the writing condition.
In an alternative embodiment of the apparatus of the present disclosure, the apparatus further includes a window adjustment module for expanding the time window if the requested data class of the next batch meets the writing condition.
In an optional apparatus embodiment of the disclosure, the window adjustment module is further configured to obtain request data of a previous scraping batch, which is adjacent to the current scraping batch, and classify the obtained request data based on the index identifier; determining a change state parameter of the request data class of the current batch relative to the request data class of the previous batch; the time window is dynamically adjusted based on the changing state parameters.
In an optional device embodiment of the disclosure, the window adjustment module is specifically configured to expand the time window when the change state parameter meets an expansion condition; in case the change state parameter does not meet the expansion condition, the time window is kept unchanged.
In an optional embodiment of the present disclosure, a window adjustment module is specifically configured to expand a time window when the change state parameter meets an expansion condition and the time window is smaller than a maximum window; and under the condition that the change state parameter meets the expansion condition and the time window is equal to the maximum window, keeping the time window unchanged.
In an optional device embodiment of the disclosure, the window adjustment module is further configured to obtain a duration state parameter of the time window when the time window is equal to the maximum window, where the duration state parameter includes at least one of a window duration time and a window batch number; and in the case that the persistence state parameter is greater than or equal to the persistence state threshold value, reducing the time window to a minimum window.
In an optional device embodiment of the disclosure, the window adjustment module is further configured to expand the time window if the request data class does not meet the writing condition; and determining the request data of the current batch based on the expanded time window, and cycling until the request data class under the time window meets the writing condition.
The data stream processing device provided by the disclosure determines request data of a current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data; determining index identifiers corresponding to each request data, wherein each index identifier points to one record; further, the request data is classified based on the index identifier, and after the request data class is obtained, the request data class is written into the corresponding record according to the index identifier under the condition that the request data class meets the writing condition. In the method, the request data is batched in batches based on a dynamic time window before writing, and the request data is classified based on the index identifier corresponding to the record, so that a plurality of request data are uniformly written, the data reading and writing times are reduced, IO consumption in data stream reading and writing is reduced, the configuration requirement of a database is simplified, the service support of a large-scale cluster is not needed, and the service operation cost is reduced; and the time window can be dynamically adjusted according to the classification result of the request data, so that the size of the time window is automatically adapted to the batch scraping effect, the dynamic data read-write requirement in the service is met, and the storage performance is flexibly improved.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 500 according to such an embodiment of the present disclosure is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, and a bus 530 connecting the various system components, including the memory unit 520 and the processing unit 510.
Wherein the storage unit stores program code that is executable by the processing unit 510 such that the processing unit 510 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present description of the exemplary method.
The storage unit 520 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 5201 and/or cache memory unit 5202, and may further include Read Only Memory (ROM) 5203.
The storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 530 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 500 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 500, and/or any device (e.g., router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. Such communication may be through the display unit 540 and an input/output (I/O) interface 550 connected to the display unit 540. Also, electronic device 500 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 560. As shown, network adapter 560 communicates with other modules of electronic device 500 over bus 530. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 500, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
In an embodiment of the present disclosure, a program product for implementing the above method is also provided, which may employ a portable compact disc read-only memory (CD-ROM) and comprise program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A method of data stream processing, the method comprising:
determining request data of a current batch based on a time window, wherein the time window is dynamically adjusted and obtained based on a classification result of the request data;
determining index identifiers corresponding to the request data, wherein each index identifier points to one record;
Classifying the request data based on the index identifier to obtain a request data class;
and under the condition that the request data class accords with the writing condition, writing the request data class into the corresponding record according to the index identifier.
2. The method of claim 1, wherein the step of dynamically adjusting the time window based on the classification result of the request data comprises:
the time window is expanded if the requested data class of the current scratch meets a write condition.
3. The method of claim 1, wherein the step of dynamically adjusting the time window based on the classification result of the request data comprises:
acquiring request data of a previous scraping batch, wherein the previous scraping batch is adjacent to the current scraping batch, and classifying the acquired request data based on the index identification;
determining a status parameter of a change in the requested data class of a current batch relative to the requested data class of the previous batch;
and dynamically adjusting the time window based on the change state parameter.
4. A method according to claim 3, wherein said dynamically adjusting said time window based on said change state parameter comprises:
Expanding the time window under the condition that the change state parameter meets an expansion condition;
and under the condition that the change state parameter does not meet the expansion condition, keeping the time window unchanged.
5. The method of claim 4, wherein expanding the time window if the change state parameter meets an expansion condition comprises:
expanding the time window under the condition that the change state parameter meets an expansion condition and the time window is smaller than a maximum window;
and under the condition that the change state parameter meets an expansion condition and the time window is equal to a maximum window, keeping the time window unchanged.
6. The method of claim 1, wherein the step of dynamically adjusting the time window based on the classification result of the request data further comprises:
acquiring a duration state parameter of the time window under the condition that the time window is equal to a maximum window, wherein the duration state parameter comprises at least one of window duration time and window batch number;
and reducing the time window to a minimum window under the condition that the continuous state parameter is larger than or equal to a continuous state threshold value.
7. The method of any of claims 1-6, wherein classifying the request data based on the index identification, after obtaining a request data class, further comprises:
expanding the time window under the condition that the request data class does not accord with the writing condition;
and determining the request data of the current batch based on the expanded time window, and circulating until the request data class accords with the writing condition under the time window.
8. A data stream processing apparatus, the apparatus comprising:
the data acquisition module is used for determining the request data of the current batch based on a time window, and the time window is dynamically adjusted and acquired based on the classification result of the request data;
the data index module is used for determining index identifiers corresponding to the request data, and each index identifier points to one record;
the data classification module is used for classifying the request data based on the index identifier to obtain a request data class;
and the data writing module is used for writing the request data class into the corresponding record according to the index identifier under the condition that the request data class meets the writing condition.
9. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data stream processing method of claims 1 to 7 via execution of the executable instructions.
10. A computer readable medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, implements the data stream processing method according to claims 1 to 7.
CN202310745597.1A 2023-06-21 2023-06-21 Data stream processing method, device, electronic equipment and computer readable medium Pending CN116775969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310745597.1A CN116775969A (en) 2023-06-21 2023-06-21 Data stream processing method, device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310745597.1A CN116775969A (en) 2023-06-21 2023-06-21 Data stream processing method, device, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
CN116775969A true CN116775969A (en) 2023-09-19

Family

ID=87992607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310745597.1A Pending CN116775969A (en) 2023-06-21 2023-06-21 Data stream processing method, device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN116775969A (en)

Similar Documents

Publication Publication Date Title
CN109756230B (en) Data compression storage method, data compression method, device, equipment and medium
US8521986B2 (en) Allocating storage memory based on future file size or use estimates
CN110109868B (en) Method, apparatus and computer program product for indexing files
US20130124796A1 (en) Storage method and apparatus which are based on data content identification
US20220164316A1 (en) Deduplication method and apparatus
CN106649145A (en) Self-adaptive cache strategy updating method and system
WO2023273544A1 (en) Log file storage method and apparatus, device, and storage medium
WO2021043026A1 (en) Storage space management method and device
US7895247B2 (en) Tracking space usage in a database
CN111857574A (en) Write request data compression method, system, terminal and storage medium
CN110502510B (en) Real-time analysis and duplicate removal method and system for WIFI terminal equipment trajectory data
US20210157683A1 (en) Method, device and computer program product for managing data backup
CN116700634B (en) Garbage recycling method and device for distributed storage system and distributed storage system
US11341055B2 (en) Method, electronic device, and computer program product for storage management
CN111858393B (en) Memory page management method, memory page management device, medium and electronic equipment
US20190114082A1 (en) Coordination Of Compaction In A Distributed Storage System
CN112597231A (en) Data processing method and device
CN116775969A (en) Data stream processing method, device, electronic equipment and computer readable medium
CN116541174A (en) Storage device capacity processing method, device, equipment and storage medium
CN112783656B (en) Memory management method, medium, device and computing equipment
CN112579576B (en) Data processing method, device, medium and computing equipment
CN116820323A (en) Data storage method, device, electronic equipment and computer readable storage medium
CN110941512B (en) Redis incremental copying method and device, terminal equipment and storage medium
CN112597112A (en) Data processing method and device, electronic equipment and storage medium
CN111782588A (en) File reading method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination