WO2017092582A1 - 一种数据处理方法和装置 - Google Patents

一种数据处理方法和装置 Download PDF

Info

Publication number
WO2017092582A1
WO2017092582A1 PCT/CN2016/106580 CN2016106580W WO2017092582A1 WO 2017092582 A1 WO2017092582 A1 WO 2017092582A1 CN 2016106580 W CN2016106580 W CN 2016106580W WO 2017092582 A1 WO2017092582 A1 WO 2017092582A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
metadata
data processing
metadata message
processing framework
Prior art date
Application number
PCT/CN2016/106580
Other languages
English (en)
French (fr)
Inventor
冯粮城
李俊良
强琦
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017092582A1 publication Critical patent/WO2017092582A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of data processing, and in particular, to a data processing method and apparatus.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • the Map function is mainly used to process the data according to the requirements.
  • the requirements can be key values (English: key or value) or key-value pairs (English: key/value pair).
  • the data processed by the Map function is transmitted to the Reduce function through the intermediate transmission process (English: Shuffle), and the processed function combines the processed data according to the requirements to obtain a data processing result that meets the requirement.
  • MapReduce runs primarily on Hadoop (a distributed system infrastructure) platform and is primarily used for offline batch computing of massive amounts of data. That is to say, after MapReduce needs to pre-plan the data set to be processed, for example, which data is processed by which Map function, etc., the data set can be batch processed.
  • MapReduce needs to pre-plan the data set to be processed, for example, which data is processed by which Map function, etc.
  • the data set can be batch processed.
  • MapReduce being mainly used to process offline data, and it is difficult to quickly process real-time generated data. It can be seen that the traditional MapReduce cannot effectively solve the problem of rapidly processing the massive data generated in real time in the network.
  • the present invention provides a data processing method and apparatus, which implements rapid processing of massive data generated in real time in a network.
  • a data processing method is applied to a stream data processing framework, where the stream data processing framework includes a plurality of map modules and a plurality of reduce modules, and the method includes:
  • the streaming data processing framework obtains a service requirement, and the service requirement includes at least one key value;
  • the stream data processing framework sequentially fetches a plurality of metadata messages from a message queue for storing metadata messages, where the message queue is determined by the service requirement, and the metadata message corresponds to the service message one by one.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data;
  • the stream data processing framework distributes the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message; And the plurality of map modules respectively process the acquired service data according to the at least one key value;
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service data according to the at least one key value. Perform a merge process to obtain a merge process result;
  • the stream data processing framework outputs the merge processing result.
  • it also includes:
  • the stream data processing framework records a processing status of the plurality of metadata messages in the stream data processing framework, the processing status including a success status and a failure status, if a service data corresponding to a metadata message is merged If the processing result is output, the processing status of the metadata message is a success status. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing status of the metadata message is a failure status. ;
  • the stream data processing framework updates the location information in the location register according to the processing state, and the location information in the location register is the location information of the next time the stream data processing framework captures the metadata message in the message queue. If the metadata message has a failure status in the plurality of metadata messages, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the location a metadata message having a failure status of a queue header of the message queue. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is a location of the second metadata message. Information, the second metadata message is a metadata message of the plurality of metadata messages that is closer to the tail of the queue of the message queue.
  • the method further includes:
  • the stream data processing framework detects whether the buffer further holds a metadata message, and the stream is saved in the cache a metadata message captured by the data processing framework from the message queue and not yet distributed to the map module;
  • the streaming data processing framework distributes the title metadata message from the buffer to the plurality of map modules
  • the method further includes:
  • the stream data processing framework calculates the consumed capacity in real time, the consumed capability being a processing capability consumed by the stream data processing framework for processing metadata messages distributed to the plurality of map modules;
  • the flow data processing framework determines whether a difference between the total processing capability of the flow data processing framework and the consumed capability meets a preset threshold.
  • the stream data processing framework distributes the captured metadata message to the plurality of map modules.
  • the flow data processing framework distributes the service data processed by the multiple map modules to the multiple reduce modules, including:
  • the stream data processing framework partially merges the processed service data to compress the processed service data before distributing the processed service data to the reduce module. The amount of data of this processed business data.
  • a data processing apparatus is applied to a stream data processing framework, the stream data processing framework comprising a plurality of map modules and a plurality of reduce modules, the apparatus comprising:
  • An obtaining unit configured to obtain a business requirement, where the business requirement includes at least one key value
  • a fetching unit configured to sequentially capture a plurality of metadata messages from a message queue for storing metadata messages, where the message queue is determined by the service requirement, and the metadata message is in one-to-one correspondence with the service message.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data;
  • a first distribution unit configured to distribute the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message;
  • the plurality of map modules respectively process the acquired service data according to the at least one key value;
  • a second distribution unit configured to distribute the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service according to the at least one key value
  • the data is merged to obtain a combined processing result
  • it also includes:
  • a recording unit configured to record a processing status of the plurality of metadata messages in the stream data processing framework, where the processing status includes a success status and a failure status, if the service data corresponding to a metadata message is processed as a merge If the result is output, the processing state of the metadata message is a success state. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing state of the metadata message is a failure state;
  • an update unit configured to update location information in the location register according to the processing state, where the location information in the location register is location information of the next time the stream data processing framework captures the metadata message in the message queue; If the plurality of metadata messages have a metadata message of a failed state, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the a metadata message with a failure status of the queue header of the message queue. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is location information of the second metadata message.
  • the second metadata message is a metadata message of the plurality of metadata messages that is closer to a tail of the queue of the message queue.
  • it also includes:
  • a detecting unit configured to: before detecting the fetching unit, detecting whether the buffer further holds a metadata message, where the cache saves the stream data processing framework from the message queue and has not been distributed yet Metadata message to the map module;
  • the first distribution unit is further configured to distribute the title metadata message from the buffer to the plurality of map modules.
  • it also includes:
  • a calculating unit configured to calculate a consumed capability in real time before triggering the first distribution unit, where the consumed capability is used by the streaming data processing framework to process a metadata message distributed to the plurality of map modules Processing capacity consumed;
  • a determining unit configured to determine whether a difference between a total processing capability of the stream data processing framework and the consumed capability meets a preset threshold
  • the first distribution unit is triggered.
  • the second distribution unit is further configured to: if the processed data volume is greater than the preset transmission amount, before the processed service data is distributed to the reduce module, the processed service is processed.
  • the data is partially merged to compress the amount of data of the processed business data.
  • the traditional MapReduce function is implemented in the stream data processing framework through the map module and the reduce module by using the stream data processing framework for real-time data processing, and is determined when receiving the service requirement.
  • the message queue in the message queue captures a metadata message whose capacity is much smaller than the service data, and distributes the metadata message to the map module, and the map module obtains the corresponding service data through the storage location in the metadata message, and then according to the The service requirements are processed accordingly, and the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules are based on the at least one key value.
  • the processed business data is merged to obtain a merge processing result, and then the merge processing result is outputted, and the metadata message in the message queue is captured in real time, and then the real-time processing characteristics of the stream data processing framework are utilized, and the processing is performed before the processing. It is no longer necessary to pre-plan the data to be processed as in the traditional way, and MapR can be implemented.
  • the stream data processing framework of the educe function quickly processes massive amounts of data generated in real time in the network.
  • FIG. 1 is a flowchart of a method for processing a data according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for discriminating a metadata message according to an embodiment of the present invention
  • FIG. 3 is a structural diagram of a device of a data processing apparatus according to an embodiment of the present invention.
  • MapReduce is mainly used for offline batch calculation of massive data. That is to say, before processing, the data set to be processed needs to be pre-planned. For real-time generated data, MapReduce needs to collect a certain data size and It can be processed after planning. It can be seen that the traditional MapReduce cannot effectively solve the problem of rapidly processing the massive data generated in real time in the network.
  • the seller needs to obtain the correspondence between the user (u)->user collection (g1, g2, g3...) by performing data processing on the data in the Internet.
  • the processing system it is necessary to process thousands of advertisement rules set by a large number of sellers every day, which requires that the data processing speed of real-time data generated on the Internet can be fast, preferably when a user is browsing related webpages.
  • the seller can display the ads associated with this related page to the user.
  • MapReduce cannot effectively solve the problem that it needs to quickly process massive amounts of data generated in real time in the network.
  • an embodiment of the present invention provides a data processing method and apparatus.
  • the traditional MapReduce function is implemented in a stream data processing framework by using a map module and a reduce module by using a stream data processing framework for real-time data processing.
  • the metadata message with the capacity much smaller than the service data is captured from the determined message queue, and the metadata message is distributed to the map module, and the map module obtains the storage location in the metadata message.
  • the corresponding service data is processed according to the service requirement, and the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules are configured according to the plurality of reduce modules.
  • the at least one key value is combined to process the processed service data to obtain a merge processing result, and then output the merge processing result, and capture the metadata message in the message queue in real time, and then use the stream data.
  • Processing the real-time processing characteristics of the framework no need to pre-process the data to be processed in the traditional way before processing Planning, can be used to achieve a massive data stream data processing framework MapReduce function of fast processing network created in real time.
  • MapReduce Although MapReduce is used in most scenarios, it can effectively process massive amounts of network data and obtain accurate processing results for business needs. But for scenarios that require timely and fast processing of data generated in real time, Since MapReduce needs to pre-plan the data to be processed, and the overhead of using MapReduce for data processing is large, it is generally collected first-time generated data of a certain scale (for example, it can be equivalent to the processing amount of MapReduce, Save computing costs), and then pre-plan and process the collected data. That is to say, although it is possible to obtain a processing result that meets the demand, the timeliness is poor, or the processing result cannot be given in time.
  • the inventor After discovering this hidden technical defect, the inventor has made up for the lack of timeliness in the traditional way of processing.
  • the inventor has selected the stream data processing framework for processing stream data, and the processing characteristics of the stream data processing framework.
  • the flow data processing framework Combining with the processing characteristics of MapReduce, the flow data processing framework combines the real-time processing capability of data and the features of MapReduce easy to process massive data.
  • the function in the stream data processing framework Based on the stream data processing framework, the function in the stream data processing framework.
  • the function of MapReduce function is implemented on the module, thus realizing a real-time MapReduce calculation method.
  • the stream data processing framework described in the embodiments of the present invention can be understood as a computer program for processing stream data or a computer service for processing stream data.
  • the streaming data processing framework can be installed and deployed in one server, in multiple servers, or in a server cluster.
  • the server implements the processing of the stream data by running the stream data processing framework deployed in itself.
  • the present invention does not limit the specific type of the stream data processing framework, and may be, for example, Spark Streaming (a computing framework for real-time computing) or Storm (a processing framework for processing big data in real time).
  • Storm can be used in "stream processing" to process data in real time.
  • Storm's modules mainly include Spout (similar to data source) modules and Bolt (data processing module) modules.
  • the Spout module can be used to capture metadata messages from the message queue, and to distribute the captured metadata messages, and implement the function of the map function on a part of the Bolt module, in a part of the Bolt module.
  • the Bolt module that implements the function of the map function can be identified as a map-bolt module
  • the Bolt module that implements the function of the reduce function can be identified as a reduce-bolt module.
  • the metadata message distributed by the Spout module is received by the map-bolt module, and the service data to be processed is obtained thereby. Between the map-bolt module and the reduce-bolt module, the processed data can still be sent from the map-bolt module to the reduce-bolt module using the Shuffle method.
  • the reduce-bolt module implements the merge processing of the processed data. Storm outputs the result of the merge process, which enables rapid processing of massive amounts of data generated in real time in the network.
  • the result of the merge processing output by the stream data processing framework may be further processed according to different scene requirements, for example, may be maintained in a persistent memory or may be added to a data queue or the like.
  • FIG. 1 is a flowchart of a method for processing a data according to an embodiment of the present disclosure, which is applied to a stream data processing framework, where the stream data processing framework includes a plurality of map modules and a plurality of reduce modules, and the method includes:
  • the stream data processing framework obtains a service requirement, where the service requirement includes at least one key value.
  • the business requirements here can be understood as the seller's advertising requirements
  • the key values can include key, value or key/value pair.
  • the key value may be a specific crowd rule, a specific crowd rule set, or a specific crowd rule set and a correspondence between users.
  • the flow data processing framework sequentially captures a plurality of metadata messages from a message queue for storing a metadata message, where the message queue is determined by the service requirement, and the metadata message corresponds to the service message one-to-one.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data.
  • the message queue here can be determined according to business requirements, and different service requirements can determine different message queues.
  • the message queue mainly includes a metadata message, and a service data is generated in the network system, and a metadata message for identifying the service data is generated correspondingly, and the data volume of the metadata message is generally much smaller than the corresponding service data.
  • the amount of data, the metadata message may include a storage location of the corresponding service data, for example, which location of the server the service data is stored in, or which location of the distributed file system is stored.
  • the metadata message may further include a data type corresponding to the service data, and is used to identify a data category, a data feature, and the like of the data corresponding to the service data.
  • the stream data processing framework may sequentially fetch according to the order of the metadata messages in the message queue.
  • the order of the message queues is generally arranged at the end of the queue of the message queue according to the newly enqueued metadata message.
  • the number of metadata messages captured by the stream data processing framework at a time may be preset, and may generally be set to a value greater than the processing capability of the stream data processing framework.
  • the stream data processing framework may divide the captured metadata message into multiple batches to the map module, thereby reducing the frequency of capturing the metadata message by the stream data processing framework to the message queue. .
  • the embodiment of the present invention provides a cache idea, which can cache the metadata message that has not been distributed after the buffering to the buffer.
  • a process of determining whether there is a metadata message in the cache may be added.
  • the stream data processing framework sequentially retrieves the metadata message from the message queue, Also includes:
  • the stream data processing framework detects whether the buffer further holds a metadata message, where the cache saves a metadata message that is captured by the stream data processing framework from the message queue and has not been distributed to the map module;
  • the streaming data processing framework distributes the title metadata message from the buffer to the plurality of map modules
  • the stream data processing framework distributes the metadata message
  • the metadata message stored in the buffer is preferentially distributed.
  • the business data can be processed as much as possible in the order of the message queues to ensure the sequence.
  • an embodiment of the present invention provides a method for determining a next crawl location by recording a state of processing a metadata message.
  • the stream data processing framework records a processing status of the plurality of metadata messages in the stream data processing framework, where the processing status includes a success status and a failure status, if a service corresponding to a metadata message The data is output as the result of the merge processing, and the processing state of the metadata message is a success state. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing of the metadata message is processed. The status is a failure status.
  • the stream data processing framework updates the location information in the location register according to the processing state, and the location information in the location register is the location information of the next time the stream data processing framework captures the metadata message in the message queue. If the metadata message has a failure status in the plurality of metadata messages, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the location a metadata message having a failure status of a queue header of the message queue. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is a location of the second metadata message. Information, the second metadata message is a metadata message of the plurality of metadata messages that is closer to the tail of the queue of the message queue.
  • the location register can be a persistent storage unit, for example, a similar storage service can be provided using Zookeeper (a distributed, open source application coordination service).
  • Zookeeper a distributed, open source application coordination service
  • a metadata message starts from the beginning of distribution to the map module (for example, S103), until the service data corresponding to the metadata message is processed and outputted by the stream data processing framework (S105), if any one If the step is unsuccessful, it can be understood that the processing status of the metadata message is a failure state.
  • the map module that receives the metadata message may not obtain the corresponding service data according to the storage location carried by the metadata module, or may be unsuccessful for the map module to process the service data corresponding to the metadata message, or may be the flow.
  • the data processing framework fails to distribute the business data processed by the map module (the business data corresponding to the metadata message) to the reduce module and the like.
  • the processing may also include a case where the timeout is processed, for example, a case where there is no response for a long time in one processing step in the above process.
  • the stream data processing framework can confirm whether the processing status of a metadata message is a successful state (received confirmation message) or a failure state by receiving an acknowledgement message (abbreviation: ACK) and a failure message (English: Fail) for the metadata message. (Received a failure message).
  • ACK acknowledgement message
  • Fail failure message
  • Metadata message if a metadata message remains in a failed state after multiple attempts, in order to improve processing efficiency, the metadata message can be discarded, and no further processing is performed, and the update is continued in the location register. Location information for other metadata messages.
  • the stream data processing framework distributes the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message;
  • the plurality of map modules respectively process the acquired service data according to the at least one key value.
  • the stream data processing framework may distribute a batch of metadata messages to each map module, for example, a batch of 10 metadata messages.
  • the stream data processing framework only works with 5 map modules, so a map module can be divided into 2 metadata messages.
  • the benefit of the average distribution is that each map module can complete the processing task in a similar amount of time in order to receive the distribution of the next metadata message of the stream data processing framework.
  • the service data can be stored in the distributed file system, and the map module can obtain the corresponding service data from the corresponding storage location in the distributed file system according to the storage location in the obtained metadata u message.
  • the map module determines the data relationship related to the key value from the business data according to the key value in the business requirement.
  • the key value of the business requirement is the correspondence between the user (u1)->user set (g1, g2, g3...), then the map module can determine u1->g1, u1->g2 from the service data through processing. ...and other data relationships related to key figures.
  • the stream data processing framework distributes the service data processed by the plurality of map modules to the plurality of The reduce module is configured to perform a merge process on the processed service data according to the at least one key value to obtain a merge process result.
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, similar to the Shuffle process in MapReduce.
  • the processed service data with the same key value can be sent to the same reduce module for processing.
  • the stream data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, and further includes:
  • the stream data processing framework partially merges the processed service data to compress the processed service data before distributing the processed service data to the reduce module. The amount of data of this processed business data.
  • the merge processing result of the multiple reduce modules is equivalent to the processing result for the service requirement, and the flow data processing framework may output the merge processing result, and then the system according to different scenarios and service requirements.
  • the merge processing result of the output is subjected to subsequent processing, for example, the merge processing result is stored in a persistent memory or the like.
  • the traditional MapReduce function is implemented in the stream data processing framework through the map module and the reduce module, and when receiving the service demand, it is grasped from the determined message queue.
  • the map module obtains the corresponding service data through the storage location in the metadata message, and then performs corresponding processing according to the service requirement.
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service data according to the at least one key value.
  • the merge processing is performed to obtain the merge processing result, and the merge processing result is outputted, and the real-time processing characteristics of the stream data processing framework are captured by real-time grabbing the metadata message in the message queue, and the traditional processing is no longer required before processing.
  • Pre-planning the processed data you can use the stream data that implements the MapReduce function. Fast processing mass data frame generated in real-time network.
  • the stream data processing framework may determine the timing of distributing metadata messages to the plurality of map modules by analyzing its remaining processing power.
  • FIG. 2 is a distribution quantity provided by an embodiment of the present invention.
  • the method further includes:
  • the stream data processing framework calculates the consumed capacity in real time, and the consumed capability is a processing capability used by the stream data processing framework to process metadata messages distributed to the plurality of map modules;
  • S202 The flow data processing framework determines whether a difference between the total processing capability of the flow data processing framework and the consumed capability meets a preset threshold. If yes, execute S203.
  • S203 The stream data processing framework distributes the captured metadata message to the plurality of map modules.
  • the rate of the distribution of the stream data processing framework needs to be carefully controlled. If it is too fast, the number of messages entering the subsequent Map module may increase rapidly in a short period of time, and the Map module and the Reduce module are used to process the corresponding service data.
  • the consumed system resources are beyond the capacity of the streaming data processing framework or the system in which it resides; if it is too slow, the metadata messages in the message queue are too much, and the real-time processing is reduced.
  • the stream data processing framework calculates the consumed capacity in real time.
  • the consumed capacity can be understood as the capability currently consumed by the streaming data processing framework to perform data processing, and the total processing capability of the streaming data processing framework can be understood as the maximum processing function that the streaming data processing framework can implement.
  • the ability to consume The consumed capacity and the total processing capacity can be measured by resources such as memory, CPU, and network card.
  • the waiting may be performed. After the flow data processing framework finishes processing the batch metadata message, the stream data processing framework will release the corresponding processing capability. After waiting for the difference to satisfy the preset threshold, the stream data processing framework may again distribute the captured metadata message to the plurality of map modules.
  • FIG. 3 is a structural diagram of a device of a data processing apparatus according to an embodiment of the present disclosure, which is applied to a stream data processing framework, where the stream data processing framework includes a plurality of map modules and a plurality of reduce modules, and the apparatus includes:
  • the obtaining unit 301 is configured to obtain a service requirement, where the service requirement includes at least one key value.
  • the fetching unit 302 is configured to sequentially capture a plurality of metadata messages from a message queue for storing metadata messages, where the message queue is determined by the service requirement, and the metadata message is in one-to-one correspondence with the service message.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data.
  • the embodiment of the present invention provides a cache idea, which can cache the metadata message that has not been distributed after the buffering to the buffer.
  • the device optional, also includes:
  • a detecting unit configured to: before the triggering the crawling unit 302, detecting whether the buffer further holds a metadata message, where the cache saves the stream data processing framework from the message queue and has not been A metadata message that is distributed to the map module.
  • the first distribution unit 303 is triggered; if not, the capture unit 302 is triggered.
  • the first distribution unit 303 is further configured to distribute the title metadata message from the buffer to the plurality of map modules.
  • the stream data processing framework distributes the metadata message
  • the metadata message stored in the buffer is preferentially distributed.
  • the business data can be processed as much as possible in the order of the message queues to ensure the sequence.
  • an embodiment of the present invention provides a method for determining a next crawl location by recording a state of processing a metadata message.
  • it also includes:
  • a recording unit configured to record a processing status of the plurality of metadata messages in the stream data processing framework, where the processing status includes a success status and a failure status, if the service data corresponding to a metadata message is processed as a merge If the result is output, the processing state of the metadata message is a success state. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing state of the metadata message is a failure state;
  • an update unit configured to update location information in the location register according to the processing state, where the location information in the location register is location information of the next time the stream data processing framework captures the metadata message in the message queue; If the plurality of metadata messages have a metadata message of a failed state, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the message queue a metadata message with a failure status of the queue header. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is location information of the second metadata message.
  • the second metadata message is a metadata message of the plurality of metadata messages that is closer to the tail of the queue of the message queue.
  • a first distribution unit 303 configured to distribute the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message;
  • the plurality of map modules respectively process the acquired service data according to the at least one key value.
  • a second distribution unit 304 configured to distribute the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules are processed according to the at least one key value
  • the business data is merged to obtain a merged processing result.
  • the second distribution unit is further configured to: if the processed data volume is greater than the preset transmission amount, before processing the processed service data to the reduce module, The subsequent business data is partially combined to compress the amount of data of the processed business data.
  • the output unit 305 is configured to output the merge processing result.
  • the traditional MapReduce function is implemented in the stream data processing framework through the map module and the reduce module, and when receiving the service demand, it is grasped from the determined message queue.
  • the map module obtains the corresponding service data through the storage location in the metadata message, and then performs corresponding processing according to the service requirement.
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service data according to the at least one key value.
  • the merge processing is performed to obtain the merge processing result, and the merge processing result is outputted, and the real-time processing characteristics of the stream data processing framework are captured by real-time grabbing the metadata message in the message queue, and the traditional processing is no longer required before processing.
  • Pre-planning the processed data you can use the stream data that implements the MapReduce function. Fast processing mass data frame generated in real-time network.
  • the stream data processing framework may determine the timing of distributing metadata messages to the plurality of map modules by analyzing its remaining processing power.
  • it also includes:
  • a calculating unit configured to calculate a consumed capability in real time before triggering the first distributing unit 303, where the consumed capability is used by the streaming data processing framework to process a metadata message distributed to the plurality of map modules Eliminated Consumption capacity;
  • a determining unit configured to determine whether a difference between a total processing capability of the stream data processing framework and the consumed capability meets a preset threshold
  • the first distribution unit is triggered.
  • the waiting may be performed. After the flow data processing framework finishes processing the batch metadata message, the stream data processing framework will release the corresponding processing capability. After waiting for the difference to satisfy the preset threshold, the stream data processing framework may again distribute the captured metadata message to the plurality of map modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明实施例公开了一种数据处理方法和装置,应用于流数据处理框架,流数据处理框架获取业务需求,从用于存储元数据消息的消息队列中依次抓取多个元数据消息,向多个map模块分发抓取的元数据消息,使得多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并相应的处理获取的业务数据;将多个map模块处理后的业务数据分发到多个reduce模块,使得多个reduce模块对所述处理后的业务数据进行合并处理,输出合并处理结果。通过实时抓取消息队列中的元数据消息,再利用流数据处理框架的实时处理特点,使用实现了MapReduce功能的流数据处理框架处理网络中实时产生的海量数据,处理前不再需要进行预先规划。

Description

一种数据处理方法和装置
本申请要求2015年12月01日递交的申请号为201510867351.7、发明名称为“一种数据处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据处理领域,特别是涉及一种数据处理方法和装置。
背景技术
随着互联网的发展,使用网络的用户增多,用户对网络的依赖变大。伴随着用户在网络上的操作,例如浏览新闻、购物等,网络上会因此实时的产生海量数据。这些海量数据对于互联网企业来说具有重要价值,例如可以通过分析数据确定出一个用户对哪些类型的网络资源更有兴趣,或者,一个网络资源被哪些用户近期所浏览过等。可见,如何能够快速有效的处理网络中实时产生的海量数据是急需解决的问题。
针对海量数据的传统处理方式是使用MapReduce,MapReduce是一种编程模型,用于对大规模数据集进行并行运算。在MapReduce中包括映射(英文:Map)函数和归约(英文:Reduce)函数。Map函数主要用于根据需求对数据进行相应的处理,这里的需求可以是关键值(英文:key或value)或者键值对(英文:key/value pair)的形式。Map函数处理后的数据通过中间传输过程(英文:Shuffle)传输到Reduce函数,由Reduce函数根据所述需求对处理后的数据进行合并,得到符合所述需求的数据处理结果。
然而,MapReduce主要在Hadoop(一种分布式***基础架构)平台上运行,主要用于对海量数据进行离线的批量计算。也就是说,MapReduce需要对待处理的数据集进行预先的规划后,例如确定哪些数据由哪个Map函数处理等之后,才能对这个数据集进行批量处理。处理前需要预先规划的特点导致了MapReduce主要用于处理离线数据,而难以对实时产生的数据进行快速的处理。可见,传统的MapReduce不能有效的解决目前需要快速处理网络中实时产生的海量数据的问题。
发明内容
为了解决上述技术问题,本发明提供了一种数据处理方法和装置,实现了对网络中实时产生的海量数据进行快速的处理。
本发明实施例公开了如下技术方案:
一种数据处理方法,应用于流数据处理框架,所述流数据处理框架包括多个map模块和多个reduce模块,所述方法包括:
所述流数据处理框架获取业务需求,所述业务需求包括至少一个关键值;
所述流数据处理框架从用于存储元数据消息的消息队列中依次抓取多个元数据消息,所述消息队列通过所述业务需求确定,所述元数据消息与业务消息一一对应,所述元数据消息包括对应的业务数据的存储位置,所述元数据消息依据所对应业务数据的生成时间先后,被顺序添加到所述消息队列中;
所述流数据处理框架向所述多个map模块分发抓取的元数据消息,使得所述多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并使得所述多个map模块根据所述至少一个关键值,相应的处理获取的业务数据;
所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果;
所述流数据处理框架输出所述合并处理结果。
可选的,还包括:
所述流数据处理框架记录所述多个元数据消息在所述流数据处理框架中的处理状态,所述处理状态包括成功状态和失败状态,若一个元数据消息所对应的业务数据被作为合并处理结果输出,则这个元数据消息的处理状态为成功状态,若一个元数据消息所对应的业务数据在预定时间内未能被作为合并处理结果输出,则这个元数据消息的处理状态为失败状态;
所述流数据处理框架根据所述处理状态更新位置寄存器中的位置信息,所述位置寄存器中的位置信息为下一次所述流数据处理框架抓取元数据消息在所述消息队列中的位置信息;若所述多个元数据消息中具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第一元数据消息的位置信息,所述第一元数据消息为相对于更靠近所述消息队列的队列头的具有失败状态的元数据消息,若所述多个元数据消息中不具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第二元数据消息的位置信息,所述第二元数据消息为所述多个元数据消息中更靠近所述消息队列的队列尾的元数据消息。
可选的,在所述流数据处理框架再一次从所述消息队列中依次抓取元数据消息之前,还包括:
所述流数据处理框架检测缓存器是否还保存有元数据消息,所述缓存中保存所述流 数据处理框架从所述消息队列中抓取的、且尚未被分发到map模块的元数据消息;
若有,所述流数据处理框架从所述缓存器中题述元数据消息向所述多个map模块分发;
若无,执行所述流数据处理框架再一次从所述消息队列中依次抓取元数据消息。
可选的,在所述流数据处理框架向所述多个map模块分发抓取的元数据消息之前,还包括:
所述流数据处理框架实时计算已消耗能力,所述已消耗能力为所述流数据处理框架用于处理分发到所述多个map模块中的元数据消息所消耗的处理能力;
所述流数据处理框架判断所述流数据处理框架的总处理能力和所述已消耗能力之间差值是否满足预设阈值,
若满足,所述流数据处理框架向所述多个map模块分发抓取的元数据消息。
可选的,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,包括:
若一个处理后的业务数据的数据量大于预设传输量,所述流数据处理框架在将这个处理后的业务数据向reduce模块分发前,对这个处理后的业务数据进行部分合并处理,以压缩这个处理后的业务数据的数据量。
一种数据处理装置,应用于流数据处理框架,所述流数据处理框架包括多个map模块和多个reduce模块,所述装置包括:
获取单元,用于获取业务需求,所述业务需求包括至少一个关键值;
抓取单元,用于从用于存储元数据消息的消息队列中依次抓取多个元数据消息,所述消息队列通过所述业务需求确定,所述元数据消息与业务消息一一对应,所述元数据消息包括对应的业务数据的存储位置,所述元数据消息依据所对应业务数据的生成时间先后,被顺序添加到所述消息队列中;
第一分发单元,用于向所述多个map模块分发抓取的元数据消息,使得所述多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并使得所述多个map模块根据所述至少一个关键值,相应的处理获取的业务数据;
第二分发单元,用于将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果;
输出单元,用于输出所述合并处理结果。
可选的,还包括:
记录单元,用于记录所述多个元数据消息在所述流数据处理框架中的处理状态,所述处理状态包括成功状态和失败状态,若一个元数据消息所对应的业务数据被作为合并处理结果输出,则这个元数据消息的处理状态为成功状态,若一个元数据消息所对应的业务数据在预定时间内未能被作为合并处理结果输出,则这个元数据消息的处理状态为失败状态;
更新单元,用于根据所述处理状态更新位置寄存器中的位置信息,所述位置寄存器中的位置信息为下一次所述流数据处理框架抓取元数据消息在所述消息队列中的位置信息;若所述多个元数据消息中具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第一元数据消息的位置信息,所述第一元数据消息为相对于更靠近所述消息队列的队列头的具有失败状态的元数据消息,若所述多个元数据消息中不具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第二元数据消息的位置信息,所述第二元数据消息为所述多个元数据消息中更靠近所述消息队列的队列尾的元数据消息。
可选的,还包括:
检测单元,用于在触发所述抓取单元之前,检测缓存器是否还保存有元数据消息,所述缓存中保存所述流数据处理框架从所述消息队列中抓取的、且尚未被分发到map模块的元数据消息;
若有,触发所述第一分发单元;若无,触发所述抓取单元;
所述第一分发单元还用于从所述缓存器中题述元数据消息向所述多个map模块分发。
可选的,还包括:
计算单元,用于在触发所述第一分发单元之前,实时计算已消耗能力,所述已消耗能力为所述流数据处理框架用于处理分发到所述多个map模块中的元数据消息所消耗的处理能力;
判断单元,用于判断所述流数据处理框架的总处理能力和所述已消耗能力之间差值是否满足预设阈值,
若满足,触发所述第一分发单元。
可选的,所述第二分发单元,还用于若一个处理后的业务数据的数据量大于预设传输量,在将这个处理后的业务数据向reduce模块分发前,对这个处理后的业务数据进行部分合并处理,以压缩这个处理后的业务数据的数据量。
由上述技术方案可以看出,通过使用针对于实时数据处理的流数据处理框架,将传统的MapReduce的功能通过map模块和reduce模块在流数据处理框架中实现,在接收到业务需求时,从确定出的消息队列中抓取容量远比业务数据要小的元数据消息,并将元数据消息分发给map模块,由map模块通过元数据消息中的存储位置获取其相对应的业务数据,再根据业务需求进行相应处理,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果,再将所述合并处理结果进行输出,通过实时抓取消息队列中的元数据消息,再利用流数据处理框架的实时处理特点,处理前不再需要如传统做法中对待处理数据进行预先规划,可以使用实现了MapReduce功能的流数据处理框架快速的处理网络中实时产生的海量数据。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种数据处理方法的方法流程图;
图2为本发明实施例提供的一种分发元数据消息判断方法的方法流程图;
图3为本发明实施例提供的一种数据处理装置的装置结构图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
随着用户在网络上的活动,互联网会实时的产生海量的数据,这些数据可以显示用户的偏好,行为规律等,这些海量数据对于互联网企业来说具有重要价值。
针对海量数据的传统处理方式是使用一种编程模型MapReduce。然而,MapReduce主要用于对海量数据进行离线的批量计算。也就是说在处理前,需要对待处理的数据集进行预先的规划。对于实时生成的数据,MapReduce需要收集一定数据规模后,并预先 规划后才能处理。可见,传统的MapReduce不能有效的解决目前需要快速处理网络中实时产生的海量数据的问题。
然而,在互联网的具体应用场景中,有一些情况下,需要对应用中不断产生的新数据进行实时计算,并能快速给出计算结果。比如,卖家投放广告或者微博上用户添加关注等。以卖家投放广告这一场景为例,卖家在定制好自己感兴趣的人群规则后,希望能够尽快的将广告投放到相关人群,给自己带来收益。假设卖家设置的人群规则集合用g1、g2...标识,用户用u1、u2...来标识。那么符合这个人群规则集合的用户集合(u1,u2,u3….)可能会有数千万,甚至上亿。同时,为了知道具体某个用户符合哪些具体的人群规则,卖家需要通过对互联网中的数据进行数据处理得到用户(u)->用户集合(g1,g2,g3…)的对应关系。而且对于处理***来说,每天需要处理大量卖家设置的成千上万的广告规则,这就要求对互联网中实时产生数据的数据处理速度能够快速,最好能够达到一个用户正在浏览相关网页时,卖家就可以将与这个相关网页有关系的广告展示给这个用户的效果。这种处理效果显然是使用传统的MapReduce对数据进行离线处理无法达成的。可见,MapReduce不能有效的解决目前需要快速处理网络中实时产生的海量数据的问题。
为此,本发明实施例提供了一种数据处理方法和装置,通过使用针对于实时数据处理的流数据处理框架,将传统的MapReduce的功能通过map模块和reduce模块在流数据处理框架中实现,在接收到业务需求时,从确定出的消息队列中抓取容量远比业务数据要小的元数据消息,并将元数据消息分发给map模块,由map模块通过元数据消息中的存储位置获取其相对应的业务数据,再根据业务需求进行相应处理,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果,再将所述合并处理结果进行输出,通过实时抓取消息队列中的元数据消息,再利用流数据处理框架的实时处理特点,处理前不再需要如传统做法中对待处理数据进行预先规划,可以使用实现了MapReduce功能的流数据处理框架快速的处理网络中实时产生的海量数据。
在描述本发明实施例之前,需要对如何发现解决上述技术问题的过程进行说明。发明人仔细分析目前这种技术问题,确定出产生这种技术问题的应用场景特点,即主要是需要对实时产生的数据进行实时处理的特点。
虽然使用MapReduce的方式在大多数场景下也可以有效处理海量网络数据,获取针对业务需求的准确处理结果。但是针对需要对实时产生的数据及时、快速处理的场景, 由于MapReduce需要对待处理的数据进行预先规划,再加上使用一次MapReduce进行数据处理的开销较大,故一般是先收集一定规模的这种实时产生的数据(例如可以与MapReduce的处理量相当,以节约运算成本),然后再对收集的数据进行预先规划和处理。也就是说,虽然能够得出符合需求的处理结果,但是时效性很差,或者说不能及时的给出处理结果。这样可能会导致向用户推送广告失败,或者推送时,用户早已离开相应的页面了,导致没法达到最佳的推送广告的效果。发明人在发现这个隐蔽的技术缺陷后,为了弥补传统方式处理时缺乏时效性的特点,发明人有针对性的选取了用于处理流数据的流数据处理框架,将流数据处理框架的处理特点和MapReduce的处理特点进行有机的结合,即将流数据处理框架对数据的实时处理能力以及MapReduce易于处理海量数据的特点进行有机的结合,以流数据处理框架为基础,在流数据处理框架中的功能模块上实现MapReduce的函数处理能力,从而实现了一种实时的MapReduce计算方法。
在本发明实施例中所述的流数据处理框架,可以理解为一种用于处理流数据的计算机程序,或者一种用于处理流数据的计算机服务。所述流数据处理框架可以安装部署在一台服务器中、多台服务器中或者服务器集群中。服务器通过运行部署在自身的流数据处理框架,以实现对流数据的处理。本发明并不限定所述流数据处理框架的具体类型,例如可以为Spark Streaming(一种用于实时计算的计算框架)或者Storm(一种实时处理大数据的处理框架)。
以Storm为例说明与MapReduce的功能结合,Storm可被用于“流处理”之中,实时处理数据。Storm的模块中主要包括Spout(类似数据源)模块和Bolt(数据处理模块)模块。根据模块的特点,可以将Spout模块用于从消息队列中抓取元数据消息,以及用于将抓取的元数据消息进行分发,而在一部分Bolt模块上实现map函数的功能,在一部分Bolt模块上实现reduce函数的功能。其中,实现了map函数功能的Bolt模块可以标识为map-bolt模块,实现了reduce函数功能的Bolt模块可以标识为reduce-bolt模块。
由map-bolt模块接收Spout模块分发的元数据消息,并以此获取需要处理的业务数据。在map-bolt模块和reduce-bolt模块之间,依然可以使用Shuffle的方式从map-bolt模块将处理后的数据发向reduce-bolt模块。reduce-bolt模块实现对处理后的数据的合并处理。Storm将所述合并处理的结果进行输出,从而实现了快速的处理网络中实时产生的海量数据。
针对其他流数据处理框架,例如Spark Streaming,同理,也可以将map函数和reduce函数的功能在Spark Streaming中相应的模块上实现。这里不再一一穷举。
通过所述流数据处理框架输出的合并处理结果可以根据不同的场景需求进行后续的处理,例如可以保持到持久化存储器中,也可以添加到数据队列中等。
图1为本发明实施例提供的一种数据处理方法的方法流程图,应用于流数据处理框架,所述流数据处理框架包括多个map模块和多个reduce模块,所述方法包括:
S101:所述流数据处理框架获取业务需求,所述业务需求包括至少一个关键值。
举例说明,这里的业务需求可以理解为卖家的投放广告需求,其中的关键值可以包括key、value或者key/value pair。以卖家投放广告为例,关键值可以为特定人群规则、特定人群规则集合,也可以是特定人群规则集合和用户之间的对应关系。
S102:所述流数据处理框架从用于存储元数据消息的消息队列中依次抓取多个元数据消息,所述消息队列通过所述业务需求确定,所述元数据消息与业务消息一一对应,所述元数据消息包括对应的业务数据的存储位置,所述元数据消息依据所对应业务数据的生成时间先后,被顺序添加到所述消息队列中。
举例说明,这里的消息队列可以是根据业务需求确定的,不同的业务需求可以确定出不同的消息队列。所述消息队列中主要包括元数据消息,网络***中生成一个业务数据,会对应生成一个用于标识这个业务数据的元数据消息,这个元数据消息的数据量一般会远小于对应的业务数据的数据量大小,这个元数据消息中可以包括对应业务数据的存储位置,例如用于标识业务数据保存在哪一个服务器的哪一个位置中,或者保存在分布式文件***的哪一个位置等。这个元数据消息还可以包括对应业务数据的数据类型,用于标识对应业务数据的数据所属数据类别、数据特点等。
在抓取元数据消息过程中,所述流数据处理框架可以依据所述消息队列中元数据消息的排列顺序,依次抓取。所述消息队列的排列顺序一般是按照新入队的元数据消息排列在所述消息队列的队列尾部。所述流数据处理框架一次抓取的元数据消息的数量可以预先设定,一般可以设置为大于所述流数据处理框架处理能力的值。所述流数据处理框架可以将一次抓取的元数据消息,分为多批次的向map模块分发,从而降低了所述流数据处理框架向所述消息队列抓取元数据消息的抓取频率。
在所述流数据处理框架一次抓取的元数据消息过多,不能一次分发出去的情况下,本发明实施例提供一种缓存思路,可以将抓取后尚未分发出去的元数据消息缓存到缓冲器中,相应的,在所述流数据处理框架再次向所述消息队列抓取元数据消息之前,可以增加判断缓存中是否有元数据消息的过程。
可选的,在所述流数据处理框架再一次从所述消息队列中依次抓取元数据消息之前, 还包括:
所述流数据处理框架检测缓存器是否还保存有元数据消息,所述缓存中保存所述流数据处理框架从所述消息队列中抓取的、且尚未被分发到map模块的元数据消息;
若有,所述流数据处理框架从所述缓存器中题述元数据消息向所述多个map模块分发;
若无,执行所述流数据处理框架再一次从所述消息队列中依次抓取元数据消息。
也就是说,在所述流数据处理框架分发元数据消息时,优先分发保存在缓冲器中的元数据消息。这样可以尽量按照消息队列中的排列顺序对业务数据进行处理,保证时序性。
需要注意的是,由于在完成对一个业务需求的数据处理中,所述流数据处理框架一般需要多次从所述消息队列中抓取元数据消息,为了确保数据处理结果的准确性,需要保证不会有业务数据被遗漏处理。那么,准确的确定出下一次抓取元数据消息时,在所述消息队列的抓取位置是非常必要的。为此,本发明实施例提供了一种通过记录元数据消息处理状态的方式确定下一次抓取位置。
可选的,所述流数据处理框架记录所述多个元数据消息在所述流数据处理框架中的处理状态,所述处理状态包括成功状态和失败状态,若一个元数据消息所对应的业务数据被作为合并处理结果输出,则这个元数据消息的处理状态为成功状态,若一个元数据消息所对应的业务数据在预定时间内未能被作为合并处理结果输出,则这个元数据消息的处理状态为失败状态。
所述流数据处理框架根据所述处理状态更新位置寄存器中的位置信息,所述位置寄存器中的位置信息为下一次所述流数据处理框架抓取元数据消息在所述消息队列中的位置信息;若所述多个元数据消息中具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第一元数据消息的位置信息,所述第一元数据消息为相对于更靠近所述消息队列的队列头的具有失败状态的元数据消息,若所述多个元数据消息中不具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第二元数据消息的位置信息,所述第二元数据消息为所述多个元数据消息中更靠近所述消息队列的队列尾的元数据消息。
举例说明,所述位置寄存器可以是一种持久化的存储单元,例如可以利用Zookeeper(一个分布式、开放源码的应用程序协调服务)提供类似的存储服务。通过持久化的存储方式,即使所述流数据处理框架在数据处理过程中出现重启、断电等情况,在再次启动时,都可以根据所述位置寄存器中存储的位置信息,从所述消息队列的所述位置寄存 器中的位置信息开始继续进行元数据消息抓取。
一个元数据消息从开始向map模块分发(例如S103)开始,直到这个元数据消息对应的业务数据经过处理后被所述流数据处理框架输出(S105)之间的这段过程中,若任何一个步骤没有成功,都可以理解为这个元数据消息的处理状态为失败状态。例如可以为接收到这个元数据消息的map模块根据其携带的存储位置未能获取对应的业务数据,也可以为map模块在处理这个元数据消息对应的业务数据不成功,也可以为所述流数据处理框架未能将通过map模块处理后的业务数据(对应这个元数据消息的业务数据)分发到reduce模块等等。需要注意的是,这里的失败处理明确的不成功(例如接收到确认处理失败的消息)以外,还可以包括处理超时的情况,例如在上述过程中的一个处理步骤中长时间无响应的情况。
所述流数据处理框架可以通过接收对于元数据消息的确认消息(缩写:ACK)和失败消息(英文:Fail)来确认一个元数据消息的处理状态是成功状态(接收到确认消息)还是失败状态(接收到失败消息)。
需要注意的是,如果一个元数据消息在多次尝试后,处理状态依然保持失败状态时,为了提高处理效率,可以将这个元数据消息舍弃,不再进行相应处理,而在位置寄存器中继续更新其他元数据消息的位置信息。
S103:所述流数据处理框架向所述多个map模块分发抓取的元数据消息,使得所述多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并使得所述多个map模块根据所述至少一个关键值,相应的处理获取的业务数据。
举例说明,在分发元数据消息时,为了保持处理的均衡性,所述流数据处理框架可以将一批元数据消息较为平均的分给各个map模块,例如一批10个元数据消息,所述流数据处理框架只有5个map模块工作,那么可以一个map模块分2个元数据消息。平均分发的好处在于,各个map模块可以在差不多时间内完成处理任务,以便接收所述流数据处理框架的下一次元数据消息的分发。
业务数据可以存储在分布式文件***中,map模块根据获取的元数据u消息中存储位置,可以从分布式文件***中的相应存储位置获取对应业务数据。map模块在获取业务数据后,根据业务需求中的关键值,从业务数据中确定出与关键值相关的数据关系。例如,业务需求的关键值是用户(u1)->用户集合(g1,g2,g3…)的对应关系,那么map模块通过处理,从业务数据中可以确定出u1->g1,u1->g2…等与关键值相关的数据关系。
S104:所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个 reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果。
举例说明,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块的过程,类似于MapReduce中的Shuffle过程。其中,可以将具有相同关键值的、处理后的业务数据发给同一个reduce模块进行合并处理。
需要注意的是,有些情况下,通过map模块处理后的业务数据的数据量将会很大,如果不进行处理直接向reduce模块传输的话,会给传输环节带来较大的传输压力,甚至导致传输错误等问题。为此,可选的,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,还包括:
若一个处理后的业务数据的数据量大于预设传输量,所述流数据处理框架在将这个处理后的业务数据向reduce模块分发前,对这个处理后的业务数据进行部分合并处理,以压缩这个处理后的业务数据的数据量。
S105:所述流数据处理框架输出所述合并处理结果。
举例说明,所述多个reduce模块的合并处理结果相当于针对所述业务需求的处理结果,所述流数据处理框架可以将所述合并处理结果进行输出,再由***根据不同的场景、业务需求对输出的所述合并处理结果进行后续的处理,例如将所述合并处理结果存储到持久化存储器中等。
可见,通过使用针对于实时数据处理的流数据处理框架,将传统的MapReduce的功能通过map模块和reduce模块在流数据处理框架中实现,在接收到业务需求时,从确定出的消息队列中抓取容量远比业务数据要小的元数据消息,并将元数据消息分发给map模块,由map模块通过元数据消息中的存储位置获取其相对应的业务数据,再根据业务需求进行相应处理,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果,再将所述合并处理结果进行输出,通过实时抓取消息队列中的元数据消息,再利用流数据处理框架的实时处理特点,处理前不再需要如传统做法中对待处理数据进行预先规划,可以使用实现了MapReduce功能的流数据处理框架快速的处理网络中实时产生的海量数据。
在处理过程中,所述流数据处理框架可以通过分析自身的剩余处理能力,判断向所述多个map模块分发元数据消息的时机。
可选的,在图1所对应实施例的基础上,图2为本发明实施例提供的一种分发元数 据消息判断方法的方法流程图,在所述流数据处理框架向所述多个map模块分发抓取的元数据消息之前,还包括:
S201:所述流数据处理框架实时计算已消耗能力,所述已消耗能力为所述流数据处理框架用于处理分发到所述多个map模块中的元数据消息所消耗的处理能力;
S202:所述流数据处理框架判断所述流数据处理框架的总处理能力和所述已消耗能力之间差值是否满足预设阈值,若满足,执行S203。
S203:所述流数据处理框架向所述多个map模块分发抓取的元数据消息。
举例说明,所述流数据处理框架的分发的速率需要注意控制,如果过快,可能会导致短时间内进入到后续Map模块的消息数量剧增,导致Map模块和Reduce模块用于处理相应业务数据所消耗的***资源超出了所述流数据处理框架或者所在***的承受能力;如果过慢,又会使得所述消息队列中的元数据消息过多,处理的实时性降低。为了解决这个问题,所述流数据处理框架实时计算已消耗能力。所述已消耗能力可以理解为目前所述流数据处理框架执行数据处理所消耗的能力,所述流数据处理框架的总处理能力可以理解为所述流数据处理框架能够实现的最大处理功能时所消耗的能力。所述已消耗能力和所述总处理能力可以通过对内存、CPU、网卡等资源进行衡量。
假设所述已消耗能力为Rcurr,所述总处理能力为Rmax,那么所述流数据处理框架的剩余能力,也就是所述流数据处理框架的总处理能力和所述已消耗能力之间差值Rleft=Rmax–Rcurr。如果所述流数据处理框架需要分发的元数据消息所要消耗的能力值(也就是所述预设阈值)大于Rleft,这时再进行元数据消息的分发可能会引起所述流数据处理框架处理延时,降低处理效率。故在所述差值满足预设阈值时才继续向所述多个map模块分发元数据消息可以更为有效的利用***资源,提高数据处理的效率。
需要注意的是,若所述差值不满足所述预设阈值,也就是说目前所述流数据处理框架的剩余处理能力不足以处理本次分发的元数据消息时,可以进行等待,一旦所述流数据处理框架完成对本批次元数据消息处理后,所述流数据处理框架将会释放出相应的处理能力。当等待所述差值满足所述预设阈值之后,所述流数据处理框架可以再一次向所述多个map模块分发抓取的元数据消息。
图3为本发明实施例提供的一种数据处理装置的装置结构图,应用于流数据处理框架,所述流数据处理框架包括多个map模块和多个reduce模块,所述装置包括:
获取单元301,用于获取业务需求,所述业务需求包括至少一个关键值。
抓取单元302,用于从用于存储元数据消息的消息队列中依次抓取多个元数据消息,所述消息队列通过所述业务需求确定,所述元数据消息与业务消息一一对应,所述元数据消息包括对应的业务数据的存储位置,所述元数据消息依据所对应业务数据的生成时间先后,被顺序添加到所述消息队列中。
在所述流数据处理框架一次抓取的元数据消息过多,不能一次分发出去的情况下,本发明实施例提供一种缓存思路,可以将抓取后尚未分发出去的元数据消息缓存到缓冲器中,可选的,还包括:
检测单元,用于在触发所述抓取单元302之前,检测缓存器是否还保存有元数据消息,所述缓存中保存所述流数据处理框架从所述消息队列中抓取的、且尚未被分发到map模块的元数据消息。
若有,触发所述第一分发单元303;若无,触发所述抓取单元302。
所述第一分发单元303还用于从所述缓存器中题述元数据消息向所述多个map模块分发。
也就是说,在所述流数据处理框架分发元数据消息时,优先分发保存在缓冲器中的元数据消息。这样可以尽量按照消息队列中的排列顺序对业务数据进行处理,保证时序性。
需要注意的是,由于在完成对一个业务需求的数据处理中,所述流数据处理框架一般需要多次从所述消息队列中抓取元数据消息,为了确保数据处理结果的准确性,需要保证不会有业务数据被遗漏处理。那么,准确的确定出下一次抓取元数据消息时,在所述消息队列的抓取位置是非常必要的。为此,本发明实施例提供了一种通过记录元数据消息处理状态的方式确定下一次抓取位置。可选的,还包括:
记录单元,用于记录所述多个元数据消息在所述流数据处理框架中的处理状态,所述处理状态包括成功状态和失败状态,若一个元数据消息所对应的业务数据被作为合并处理结果输出,则这个元数据消息的处理状态为成功状态,若一个元数据消息所对应的业务数据在预定时间内未能被作为合并处理结果输出,则这个元数据消息的处理状态为失败状态;
更新单元,用于根据所述处理状态更新位置寄存器中的位置信息,所述位置寄存器中的位置信息为下一次所述流数据处理框架抓取元数据消息在所述消息队列中的位置信息;若所述多个元数据消息中具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第一元数据消息的位置信息,所述第一元数据消息为相对于更靠近所述消息队列 的队列头的具有失败状态的元数据消息,若所述多个元数据消息中不具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第二元数据消息的位置信息,所述第二元数据消息为所述多个元数据消息中更靠近所述消息队列的队列尾的元数据消息。
第一分发单元303,用于向所述多个map模块分发抓取的元数据消息,使得所述多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并使得所述多个map模块根据所述至少一个关键值,相应的处理获取的业务数据。
第二分发单元304,用于将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果。
需要注意的是,有些情况下,通过map模块处理后的业务数据的数据量将会很大,如果不进行处理直接向reduce模块传输的话,会给传输环节带来较大的传输压力,甚至导致传输错误等问题。为此,可选的,所述第二分发单元,还用于若一个处理后的业务数据的数据量大于预设传输量,在将这个处理后的业务数据向reduce模块分发前,对这个处理后的业务数据进行部分合并处理,以压缩这个处理后的业务数据的数据量。
输出单元305,用于输出所述合并处理结果。
可见,通过使用针对于实时数据处理的流数据处理框架,将传统的MapReduce的功能通过map模块和reduce模块在流数据处理框架中实现,在接收到业务需求时,从确定出的消息队列中抓取容量远比业务数据要小的元数据消息,并将元数据消息分发给map模块,由map模块通过元数据消息中的存储位置获取其相对应的业务数据,再根据业务需求进行相应处理,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果,再将所述合并处理结果进行输出,通过实时抓取消息队列中的元数据消息,再利用流数据处理框架的实时处理特点,处理前不再需要如传统做法中对待处理数据进行预先规划,可以使用实现了MapReduce功能的流数据处理框架快速的处理网络中实时产生的海量数据。
在处理过程中,所述流数据处理框架可以通过分析自身的剩余处理能力,判断向所述多个map模块分发元数据消息的时机。
可选的,还包括:
计算单元,用于在触发所述第一分发单元303之前,实时计算已消耗能力,所述已消耗能力为所述流数据处理框架用于处理分发到所述多个map模块中的元数据消息所消 耗的处理能力;
判断单元,用于判断所述流数据处理框架的总处理能力和所述已消耗能力之间差值是否满足预设阈值,
若满足,触发所述第一分发单元。
需要注意的是,若所述差值不满足所述预设阈值,也就是说目前所述流数据处理框架的剩余处理能力不足以处理本次分发的元数据消息时,可以进行等待,一旦所述流数据处理框架完成对本批次元数据消息处理后,所述流数据处理框架将会释放出相应的处理能力。当等待所述差值满足所述预设阈值之后,所述流数据处理框架可以再一次向所述多个map模块分发抓取的元数据消息。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质可以是下述介质中的至少一种:只读存储器(英文:read-only memory,缩写:ROM)、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于设备及***实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的设备及***实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。

Claims (10)

  1. 一种数据处理方法,其特征在于,应用于流数据处理框架,所述流数据处理框架包括多个映射map模块和多个归约reduce模块,所述方法包括:
    所述流数据处理框架获取业务需求,所述业务需求包括至少一个关键值;
    所述流数据处理框架从用于存储元数据消息的消息队列中依次抓取多个元数据消息,所述消息队列通过所述业务需求确定,所述元数据消息与业务消息一一对应,所述元数据消息包括对应的业务数据的存储位置,所述元数据消息依据所对应业务数据的生成时间先后,被顺序添加到所述消息队列中;
    所述流数据处理框架向所述多个map模块分发抓取的元数据消息,使得所述多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并使得所述多个map模块根据所述至少一个关键值,相应的处理获取的业务数据;
    所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果;
    所述流数据处理框架输出所述合并处理结果。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    所述流数据处理框架记录所述多个元数据消息在所述流数据处理框架中的处理状态,所述处理状态包括成功状态和失败状态,若一个元数据消息所对应的业务数据被作为合并处理结果输出,则这个元数据消息的处理状态为成功状态,若一个元数据消息所对应的业务数据在预定时间内未能被作为合并处理结果输出,则这个元数据消息的处理状态为失败状态;
    所述流数据处理框架根据所述处理状态更新位置寄存器中的位置信息,所述位置寄存器中的位置信息为下一次所述流数据处理框架抓取元数据消息在所述消息队列中的位置信息;若所述多个元数据消息中具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第一元数据消息的位置信息,所述第一元数据消息为相对于更靠近所述消息队列的队列头的具有失败状态的元数据消息,若所述多个元数据消息中不具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第二元数据消息的位置信息,所述第二元数据消息为所述多个元数据消息中更靠近所述消息队列的队列尾的元数据消息。
  3. 根据权利要求1或2所述的方法,其特征在于,在所述流数据处理框架再一次从所述消息队列中依次抓取元数据消息之前,还包括:
    所述流数据处理框架检测缓存器是否还保存有元数据消息,所述缓存中保存所述流数据处理框架从所述消息队列中抓取的、且尚未被分发到map模块的元数据消息;
    若有,所述流数据处理框架从所述缓存器中题述元数据消息向所述多个map模块分发;
    若无,执行所述流数据处理框架再一次从所述消息队列中依次抓取元数据消息。
  4. 根据权利要求1或2所述的方法,其特征在于,在所述流数据处理框架向所述多个map模块分发抓取的元数据消息之前,还包括:
    所述流数据处理框架实时计算已消耗能力,所述已消耗能力为所述流数据处理框架用于处理分发到所述多个map模块中的元数据消息所消耗的处理能力;
    所述流数据处理框架判断所述流数据处理框架的总处理能力和所述已消耗能力之间差值是否满足预设阈值,
    若满足,所述流数据处理框架向所述多个map模块分发抓取的元数据消息。
  5. 根据权利要求1所述的方法,其特征在于,所述流数据处理框架将所述多个map模块处理后的业务数据分发到所述多个reduce模块,包括:
    若一个处理后的业务数据的数据量大于预设传输量,所述流数据处理框架在将这个处理后的业务数据向reduce模块分发前,对这个处理后的业务数据进行部分合并处理,以压缩这个处理后的业务数据的数据量。
  6. 一种数据处理装置,其特征在于,应用于流数据处理框架,所述流数据处理框架包括多个映射map模块和多个归约reduce模块,所述装置包括:
    获取单元,用于获取业务需求,所述业务需求包括至少一个关键值;
    抓取单元,用于从用于存储元数据消息的消息队列中依次抓取多个元数据消息,所述消息队列通过所述业务需求确定,所述元数据消息与业务消息一一对应,所述元数据消息包括对应的业务数据的存储位置,所述元数据消息依据所对应业务数据的生成时间先后,被顺序添加到所述消息队列中;
    第一分发单元,用于向所述多个map模块分发抓取的元数据消息,使得所述多个map模块根据接收到的元数据消息中的存储位置,获取对应的业务数据;并使得所述多个map模块根据所述至少一个关键值,相应的处理获取的业务数据;
    第二分发单元,用于将所述多个map模块处理后的业务数据分发到所述多个reduce模块,使得所述多个reduce模块根据所述至少一个关键值,对所述处理后的业务数据进行合并处理以得到合并处理结果;
    输出单元,用于输出所述合并处理结果。
  7. 根据权利要求6所述的装置,其特征在于,还包括:
    记录单元,用于记录所述多个元数据消息在所述流数据处理框架中的处理状态,所述处理状态包括成功状态和失败状态,若一个元数据消息所对应的业务数据被作为合并处理结果输出,则这个元数据消息的处理状态为成功状态,若一个元数据消息所对应的业务数据在预定时间内未能被作为合并处理结果输出,则这个元数据消息的处理状态为失败状态;
    更新单元,用于根据所述处理状态更新位置寄存器中的位置信息,所述位置寄存器中的位置信息为下一次所述流数据处理框架抓取元数据消息在所述消息队列中的位置信息;若所述多个元数据消息中具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第一元数据消息的位置信息,所述第一元数据消息为相对于更靠近所述消息队列的队列头的具有失败状态的元数据消息,若所述多个元数据消息中不具有失败状态的元数据消息,则所述位置寄存器中的位置信息为第二元数据消息的位置信息,所述第二元数据消息为所述多个元数据消息中更靠近所述消息队列的队列尾的元数据消息。
  8. 根据权利要求6或7所述的装置,其特征在于,还包括:
    检测单元,用于在触发所述抓取单元之前,检测缓存器是否还保存有元数据消息,所述缓存中保存所述流数据处理框架从所述消息队列中抓取的、且尚未被分发到map模块的元数据消息;
    若有,触发所述第一分发单元;若无,触发所述抓取单元;
    所述第一分发单元还用于从所述缓存器中题述元数据消息向所述多个map模块分发。
  9. 根据权利要求6或7所述的装置,其特征在于,还包括:
    计算单元,用于在触发所述第一分发单元之前,实时计算已消耗能力,所述已消耗能力为所述流数据处理框架用于处理分发到所述多个map模块中的元数据消息所消耗的处理能力;
    判断单元,用于判断所述流数据处理框架的总处理能力和所述已消耗能力之间差值是否满足预设阈值,
    若满足,触发所述第一分发单元。
  10. 根据权利要求6所述的装置,其特征在于,所述第二分发单元,还用于若一个处理后的业务数据的数据量大于预设传输量,在将这个处理后的业务数据向reduce模块 分发前,对这个处理后的业务数据进行部分合并处理,以压缩这个处理后的业务数据的数据量。
PCT/CN2016/106580 2015-12-01 2016-11-21 一种数据处理方法和装置 WO2017092582A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510867351.7 2015-12-01
CN201510867351.7A CN106815254B (zh) 2015-12-01 2015-12-01 一种数据处理方法和装置

Publications (1)

Publication Number Publication Date
WO2017092582A1 true WO2017092582A1 (zh) 2017-06-08

Family

ID=58796355

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/106580 WO2017092582A1 (zh) 2015-12-01 2016-11-21 一种数据处理方法和装置

Country Status (2)

Country Link
CN (1) CN106815254B (zh)
WO (1) WO2017092582A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143415A (zh) * 2019-12-26 2020-05-12 政采云有限公司 一种数据处理方法、装置和计算机可读存储介质
CN111339071A (zh) * 2020-02-21 2020-06-26 苏宁云计算有限公司 一种多源异构数据的处理方法及装置
CN111400059A (zh) * 2020-03-09 2020-07-10 五八有限公司 一种数据处理方法以及数据处理装置
CN111400390A (zh) * 2020-04-08 2020-07-10 上海东普信息科技有限公司 数据处理方法及装置
CN113034178A (zh) * 2021-03-15 2021-06-25 深圳市麦谷科技有限公司 多***积分计算方法、装置、终端设备和存储介质
CN113609202A (zh) * 2021-08-11 2021-11-05 湖南快乐阳光互动娱乐传媒有限公司 数据处理方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009849B (zh) * 2017-11-30 2021-12-17 北京小度互娱科技有限公司 生成账号状态的方法以及生成账号状态的装置
CN107872465A (zh) * 2017-12-05 2018-04-03 全球能源互联网研究院有限公司 一种分布式网络安全监测方法及***
CN112667411B (zh) * 2019-10-16 2022-12-13 中移(苏州)软件技术有限公司 一种数据处理的方法、装置、电子设备和计算机存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530182A (zh) * 2013-10-22 2014-01-22 海南大学 一种作业调度方法和装置
CN104903894A (zh) * 2013-01-07 2015-09-09 脸谱公司 用于分布式数据库查询引擎的***和方法
CN104951509A (zh) * 2015-05-25 2015-09-30 中国科学院信息工程研究所 一种大数据在线交互式查询方法及***

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102377824A (zh) * 2011-10-19 2012-03-14 江西省南城县网信电子有限公司 一种基于云计算的空间信息服务***
CN102521232B (zh) * 2011-11-09 2014-05-07 Ut斯达康通讯有限公司 一种互联网元数据的分布式采集处理***及方法
US9639575B2 (en) * 2012-03-30 2017-05-02 Khalifa University Of Science, Technology And Research Method and system for processing data queries

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104903894A (zh) * 2013-01-07 2015-09-09 脸谱公司 用于分布式数据库查询引擎的***和方法
CN103530182A (zh) * 2013-10-22 2014-01-22 海南大学 一种作业调度方法和装置
CN104951509A (zh) * 2015-05-25 2015-09-30 中国科学院信息工程研究所 一种大数据在线交互式查询方法及***

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143415A (zh) * 2019-12-26 2020-05-12 政采云有限公司 一种数据处理方法、装置和计算机可读存储介质
CN111143415B (zh) * 2019-12-26 2023-12-29 政采云有限公司 一种数据处理方法、装置和计算机可读存储介质
CN111339071A (zh) * 2020-02-21 2020-06-26 苏宁云计算有限公司 一种多源异构数据的处理方法及装置
CN111339071B (zh) * 2020-02-21 2022-11-18 苏宁云计算有限公司 一种多源异构数据的处理方法及装置
CN111400059A (zh) * 2020-03-09 2020-07-10 五八有限公司 一种数据处理方法以及数据处理装置
CN111400059B (zh) * 2020-03-09 2023-11-14 五八有限公司 一种数据处理方法以及数据处理装置
CN111400390A (zh) * 2020-04-08 2020-07-10 上海东普信息科技有限公司 数据处理方法及装置
CN111400390B (zh) * 2020-04-08 2023-11-17 上海东普信息科技有限公司 数据处理方法及装置
CN113034178A (zh) * 2021-03-15 2021-06-25 深圳市麦谷科技有限公司 多***积分计算方法、装置、终端设备和存储介质
CN113609202A (zh) * 2021-08-11 2021-11-05 湖南快乐阳光互动娱乐传媒有限公司 数据处理方法及装置

Also Published As

Publication number Publication date
CN106815254B (zh) 2020-08-14
CN106815254A (zh) 2017-06-09

Similar Documents

Publication Publication Date Title
WO2017092582A1 (zh) 一种数据处理方法和装置
US11411825B2 (en) In intelligent autoscale of services
CN111124819B (zh) 全链路监控的方法和装置
CN113037823B (zh) 消息传递***和方法
JP6030144B2 (ja) 分散データストリーム処理の方法及びシステム
US8935395B2 (en) Correlation of distributed business transactions
CN108776934B (zh) 分布式数据计算方法、装置、计算机设备及可读存储介质
US11646972B2 (en) Dynamic allocation of network resources using external inputs
WO2020258290A1 (zh) 日志数据收集方法、日志数据收集装置、存储介质和日志数据收集***
CN111459986B (zh) 数据计算***及方法
CN109726074A (zh) 日志处理方法、装置、计算机设备和存储介质
US20120072575A1 (en) Methods and computer program products for aggregating network application performance metrics by process pool
US11570078B2 (en) Collecting route-based traffic metrics in a service-oriented system
JP2012043409A (ja) データ・ストリームを処理するためのコンピュータ実装方法、システム及びコンピュータ・プログラム
CN111586126A (zh) 小程序预下载方法、装置、设备及存储介质
US20140280610A1 (en) Identification of users for initiating information spreading in a social network
WO2017181614A1 (zh) 流式数据定位方法、装置及电子设备
US20190005534A1 (en) Providing media assets to subscribers of a messaging system
US20200110826A1 (en) Efficient event correlation in a streaming environment
CN106126519A (zh) 媒体信息的展示方法及服务器
CN111639902A (zh) 基于kafka的数据审核方法、控制装置及计算机设备、存储介质
US10574765B2 (en) Method, device, and non-transitory computer-readable recording medium
CN110493250A (zh) 一种web前端arcgis资源请求处理方法及装置
CN110245120B (zh) 流式计算***及流式计算***的日志数据处理方法
CN107480189A (zh) 一种多维度实时分析***及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869892

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869892

Country of ref document: EP

Kind code of ref document: A1