WO2017092582A1 - Procédé et appareil de traitement de données - Google Patents

Procédé et appareil de traitement de données Download PDF

Info

Publication number
WO2017092582A1
WO2017092582A1 PCT/CN2016/106580 CN2016106580W WO2017092582A1 WO 2017092582 A1 WO2017092582 A1 WO 2017092582A1 CN 2016106580 W CN2016106580 W CN 2016106580W WO 2017092582 A1 WO2017092582 A1 WO 2017092582A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
metadata
data processing
metadata message
processing framework
Prior art date
Application number
PCT/CN2016/106580
Other languages
English (en)
Chinese (zh)
Inventor
冯粮城
李俊良
强琦
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017092582A1 publication Critical patent/WO2017092582A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of data processing, and in particular, to a data processing method and apparatus.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • MapReduce a programming model for parallel computing on large data sets.
  • the Map function is mainly used to process the data according to the requirements.
  • the requirements can be key values (English: key or value) or key-value pairs (English: key/value pair).
  • the data processed by the Map function is transmitted to the Reduce function through the intermediate transmission process (English: Shuffle), and the processed function combines the processed data according to the requirements to obtain a data processing result that meets the requirement.
  • MapReduce runs primarily on Hadoop (a distributed system infrastructure) platform and is primarily used for offline batch computing of massive amounts of data. That is to say, after MapReduce needs to pre-plan the data set to be processed, for example, which data is processed by which Map function, etc., the data set can be batch processed.
  • MapReduce needs to pre-plan the data set to be processed, for example, which data is processed by which Map function, etc.
  • the data set can be batch processed.
  • MapReduce being mainly used to process offline data, and it is difficult to quickly process real-time generated data. It can be seen that the traditional MapReduce cannot effectively solve the problem of rapidly processing the massive data generated in real time in the network.
  • the present invention provides a data processing method and apparatus, which implements rapid processing of massive data generated in real time in a network.
  • a data processing method is applied to a stream data processing framework, where the stream data processing framework includes a plurality of map modules and a plurality of reduce modules, and the method includes:
  • the streaming data processing framework obtains a service requirement, and the service requirement includes at least one key value;
  • the stream data processing framework sequentially fetches a plurality of metadata messages from a message queue for storing metadata messages, where the message queue is determined by the service requirement, and the metadata message corresponds to the service message one by one.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data;
  • the stream data processing framework distributes the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message; And the plurality of map modules respectively process the acquired service data according to the at least one key value;
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service data according to the at least one key value. Perform a merge process to obtain a merge process result;
  • the stream data processing framework outputs the merge processing result.
  • it also includes:
  • the stream data processing framework records a processing status of the plurality of metadata messages in the stream data processing framework, the processing status including a success status and a failure status, if a service data corresponding to a metadata message is merged If the processing result is output, the processing status of the metadata message is a success status. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing status of the metadata message is a failure status. ;
  • the stream data processing framework updates the location information in the location register according to the processing state, and the location information in the location register is the location information of the next time the stream data processing framework captures the metadata message in the message queue. If the metadata message has a failure status in the plurality of metadata messages, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the location a metadata message having a failure status of a queue header of the message queue. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is a location of the second metadata message. Information, the second metadata message is a metadata message of the plurality of metadata messages that is closer to the tail of the queue of the message queue.
  • the method further includes:
  • the stream data processing framework detects whether the buffer further holds a metadata message, and the stream is saved in the cache a metadata message captured by the data processing framework from the message queue and not yet distributed to the map module;
  • the streaming data processing framework distributes the title metadata message from the buffer to the plurality of map modules
  • the method further includes:
  • the stream data processing framework calculates the consumed capacity in real time, the consumed capability being a processing capability consumed by the stream data processing framework for processing metadata messages distributed to the plurality of map modules;
  • the flow data processing framework determines whether a difference between the total processing capability of the flow data processing framework and the consumed capability meets a preset threshold.
  • the stream data processing framework distributes the captured metadata message to the plurality of map modules.
  • the flow data processing framework distributes the service data processed by the multiple map modules to the multiple reduce modules, including:
  • the stream data processing framework partially merges the processed service data to compress the processed service data before distributing the processed service data to the reduce module. The amount of data of this processed business data.
  • a data processing apparatus is applied to a stream data processing framework, the stream data processing framework comprising a plurality of map modules and a plurality of reduce modules, the apparatus comprising:
  • An obtaining unit configured to obtain a business requirement, where the business requirement includes at least one key value
  • a fetching unit configured to sequentially capture a plurality of metadata messages from a message queue for storing metadata messages, where the message queue is determined by the service requirement, and the metadata message is in one-to-one correspondence with the service message.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data;
  • a first distribution unit configured to distribute the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message;
  • the plurality of map modules respectively process the acquired service data according to the at least one key value;
  • a second distribution unit configured to distribute the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service according to the at least one key value
  • the data is merged to obtain a combined processing result
  • it also includes:
  • a recording unit configured to record a processing status of the plurality of metadata messages in the stream data processing framework, where the processing status includes a success status and a failure status, if the service data corresponding to a metadata message is processed as a merge If the result is output, the processing state of the metadata message is a success state. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing state of the metadata message is a failure state;
  • an update unit configured to update location information in the location register according to the processing state, where the location information in the location register is location information of the next time the stream data processing framework captures the metadata message in the message queue; If the plurality of metadata messages have a metadata message of a failed state, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the a metadata message with a failure status of the queue header of the message queue. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is location information of the second metadata message.
  • the second metadata message is a metadata message of the plurality of metadata messages that is closer to a tail of the queue of the message queue.
  • it also includes:
  • a detecting unit configured to: before detecting the fetching unit, detecting whether the buffer further holds a metadata message, where the cache saves the stream data processing framework from the message queue and has not been distributed yet Metadata message to the map module;
  • the first distribution unit is further configured to distribute the title metadata message from the buffer to the plurality of map modules.
  • it also includes:
  • a calculating unit configured to calculate a consumed capability in real time before triggering the first distribution unit, where the consumed capability is used by the streaming data processing framework to process a metadata message distributed to the plurality of map modules Processing capacity consumed;
  • a determining unit configured to determine whether a difference between a total processing capability of the stream data processing framework and the consumed capability meets a preset threshold
  • the first distribution unit is triggered.
  • the second distribution unit is further configured to: if the processed data volume is greater than the preset transmission amount, before the processed service data is distributed to the reduce module, the processed service is processed.
  • the data is partially merged to compress the amount of data of the processed business data.
  • the traditional MapReduce function is implemented in the stream data processing framework through the map module and the reduce module by using the stream data processing framework for real-time data processing, and is determined when receiving the service requirement.
  • the message queue in the message queue captures a metadata message whose capacity is much smaller than the service data, and distributes the metadata message to the map module, and the map module obtains the corresponding service data through the storage location in the metadata message, and then according to the The service requirements are processed accordingly, and the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules are based on the at least one key value.
  • the processed business data is merged to obtain a merge processing result, and then the merge processing result is outputted, and the metadata message in the message queue is captured in real time, and then the real-time processing characteristics of the stream data processing framework are utilized, and the processing is performed before the processing. It is no longer necessary to pre-plan the data to be processed as in the traditional way, and MapR can be implemented.
  • the stream data processing framework of the educe function quickly processes massive amounts of data generated in real time in the network.
  • FIG. 1 is a flowchart of a method for processing a data according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for discriminating a metadata message according to an embodiment of the present invention
  • FIG. 3 is a structural diagram of a device of a data processing apparatus according to an embodiment of the present invention.
  • MapReduce is mainly used for offline batch calculation of massive data. That is to say, before processing, the data set to be processed needs to be pre-planned. For real-time generated data, MapReduce needs to collect a certain data size and It can be processed after planning. It can be seen that the traditional MapReduce cannot effectively solve the problem of rapidly processing the massive data generated in real time in the network.
  • the seller needs to obtain the correspondence between the user (u)->user collection (g1, g2, g3...) by performing data processing on the data in the Internet.
  • the processing system it is necessary to process thousands of advertisement rules set by a large number of sellers every day, which requires that the data processing speed of real-time data generated on the Internet can be fast, preferably when a user is browsing related webpages.
  • the seller can display the ads associated with this related page to the user.
  • MapReduce cannot effectively solve the problem that it needs to quickly process massive amounts of data generated in real time in the network.
  • an embodiment of the present invention provides a data processing method and apparatus.
  • the traditional MapReduce function is implemented in a stream data processing framework by using a map module and a reduce module by using a stream data processing framework for real-time data processing.
  • the metadata message with the capacity much smaller than the service data is captured from the determined message queue, and the metadata message is distributed to the map module, and the map module obtains the storage location in the metadata message.
  • the corresponding service data is processed according to the service requirement, and the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules are configured according to the plurality of reduce modules.
  • the at least one key value is combined to process the processed service data to obtain a merge processing result, and then output the merge processing result, and capture the metadata message in the message queue in real time, and then use the stream data.
  • Processing the real-time processing characteristics of the framework no need to pre-process the data to be processed in the traditional way before processing Planning, can be used to achieve a massive data stream data processing framework MapReduce function of fast processing network created in real time.
  • MapReduce Although MapReduce is used in most scenarios, it can effectively process massive amounts of network data and obtain accurate processing results for business needs. But for scenarios that require timely and fast processing of data generated in real time, Since MapReduce needs to pre-plan the data to be processed, and the overhead of using MapReduce for data processing is large, it is generally collected first-time generated data of a certain scale (for example, it can be equivalent to the processing amount of MapReduce, Save computing costs), and then pre-plan and process the collected data. That is to say, although it is possible to obtain a processing result that meets the demand, the timeliness is poor, or the processing result cannot be given in time.
  • the inventor After discovering this hidden technical defect, the inventor has made up for the lack of timeliness in the traditional way of processing.
  • the inventor has selected the stream data processing framework for processing stream data, and the processing characteristics of the stream data processing framework.
  • the flow data processing framework Combining with the processing characteristics of MapReduce, the flow data processing framework combines the real-time processing capability of data and the features of MapReduce easy to process massive data.
  • the function in the stream data processing framework Based on the stream data processing framework, the function in the stream data processing framework.
  • the function of MapReduce function is implemented on the module, thus realizing a real-time MapReduce calculation method.
  • the stream data processing framework described in the embodiments of the present invention can be understood as a computer program for processing stream data or a computer service for processing stream data.
  • the streaming data processing framework can be installed and deployed in one server, in multiple servers, or in a server cluster.
  • the server implements the processing of the stream data by running the stream data processing framework deployed in itself.
  • the present invention does not limit the specific type of the stream data processing framework, and may be, for example, Spark Streaming (a computing framework for real-time computing) or Storm (a processing framework for processing big data in real time).
  • Storm can be used in "stream processing" to process data in real time.
  • Storm's modules mainly include Spout (similar to data source) modules and Bolt (data processing module) modules.
  • the Spout module can be used to capture metadata messages from the message queue, and to distribute the captured metadata messages, and implement the function of the map function on a part of the Bolt module, in a part of the Bolt module.
  • the Bolt module that implements the function of the map function can be identified as a map-bolt module
  • the Bolt module that implements the function of the reduce function can be identified as a reduce-bolt module.
  • the metadata message distributed by the Spout module is received by the map-bolt module, and the service data to be processed is obtained thereby. Between the map-bolt module and the reduce-bolt module, the processed data can still be sent from the map-bolt module to the reduce-bolt module using the Shuffle method.
  • the reduce-bolt module implements the merge processing of the processed data. Storm outputs the result of the merge process, which enables rapid processing of massive amounts of data generated in real time in the network.
  • the result of the merge processing output by the stream data processing framework may be further processed according to different scene requirements, for example, may be maintained in a persistent memory or may be added to a data queue or the like.
  • FIG. 1 is a flowchart of a method for processing a data according to an embodiment of the present disclosure, which is applied to a stream data processing framework, where the stream data processing framework includes a plurality of map modules and a plurality of reduce modules, and the method includes:
  • the stream data processing framework obtains a service requirement, where the service requirement includes at least one key value.
  • the business requirements here can be understood as the seller's advertising requirements
  • the key values can include key, value or key/value pair.
  • the key value may be a specific crowd rule, a specific crowd rule set, or a specific crowd rule set and a correspondence between users.
  • the flow data processing framework sequentially captures a plurality of metadata messages from a message queue for storing a metadata message, where the message queue is determined by the service requirement, and the metadata message corresponds to the service message one-to-one.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data.
  • the message queue here can be determined according to business requirements, and different service requirements can determine different message queues.
  • the message queue mainly includes a metadata message, and a service data is generated in the network system, and a metadata message for identifying the service data is generated correspondingly, and the data volume of the metadata message is generally much smaller than the corresponding service data.
  • the amount of data, the metadata message may include a storage location of the corresponding service data, for example, which location of the server the service data is stored in, or which location of the distributed file system is stored.
  • the metadata message may further include a data type corresponding to the service data, and is used to identify a data category, a data feature, and the like of the data corresponding to the service data.
  • the stream data processing framework may sequentially fetch according to the order of the metadata messages in the message queue.
  • the order of the message queues is generally arranged at the end of the queue of the message queue according to the newly enqueued metadata message.
  • the number of metadata messages captured by the stream data processing framework at a time may be preset, and may generally be set to a value greater than the processing capability of the stream data processing framework.
  • the stream data processing framework may divide the captured metadata message into multiple batches to the map module, thereby reducing the frequency of capturing the metadata message by the stream data processing framework to the message queue. .
  • the embodiment of the present invention provides a cache idea, which can cache the metadata message that has not been distributed after the buffering to the buffer.
  • a process of determining whether there is a metadata message in the cache may be added.
  • the stream data processing framework sequentially retrieves the metadata message from the message queue, Also includes:
  • the stream data processing framework detects whether the buffer further holds a metadata message, where the cache saves a metadata message that is captured by the stream data processing framework from the message queue and has not been distributed to the map module;
  • the streaming data processing framework distributes the title metadata message from the buffer to the plurality of map modules
  • the stream data processing framework distributes the metadata message
  • the metadata message stored in the buffer is preferentially distributed.
  • the business data can be processed as much as possible in the order of the message queues to ensure the sequence.
  • an embodiment of the present invention provides a method for determining a next crawl location by recording a state of processing a metadata message.
  • the stream data processing framework records a processing status of the plurality of metadata messages in the stream data processing framework, where the processing status includes a success status and a failure status, if a service corresponding to a metadata message The data is output as the result of the merge processing, and the processing state of the metadata message is a success state. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing of the metadata message is processed. The status is a failure status.
  • the stream data processing framework updates the location information in the location register according to the processing state, and the location information in the location register is the location information of the next time the stream data processing framework captures the metadata message in the message queue. If the metadata message has a failure status in the plurality of metadata messages, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the location a metadata message having a failure status of a queue header of the message queue. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is a location of the second metadata message. Information, the second metadata message is a metadata message of the plurality of metadata messages that is closer to the tail of the queue of the message queue.
  • the location register can be a persistent storage unit, for example, a similar storage service can be provided using Zookeeper (a distributed, open source application coordination service).
  • Zookeeper a distributed, open source application coordination service
  • a metadata message starts from the beginning of distribution to the map module (for example, S103), until the service data corresponding to the metadata message is processed and outputted by the stream data processing framework (S105), if any one If the step is unsuccessful, it can be understood that the processing status of the metadata message is a failure state.
  • the map module that receives the metadata message may not obtain the corresponding service data according to the storage location carried by the metadata module, or may be unsuccessful for the map module to process the service data corresponding to the metadata message, or may be the flow.
  • the data processing framework fails to distribute the business data processed by the map module (the business data corresponding to the metadata message) to the reduce module and the like.
  • the processing may also include a case where the timeout is processed, for example, a case where there is no response for a long time in one processing step in the above process.
  • the stream data processing framework can confirm whether the processing status of a metadata message is a successful state (received confirmation message) or a failure state by receiving an acknowledgement message (abbreviation: ACK) and a failure message (English: Fail) for the metadata message. (Received a failure message).
  • ACK acknowledgement message
  • Fail failure message
  • Metadata message if a metadata message remains in a failed state after multiple attempts, in order to improve processing efficiency, the metadata message can be discarded, and no further processing is performed, and the update is continued in the location register. Location information for other metadata messages.
  • the stream data processing framework distributes the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message;
  • the plurality of map modules respectively process the acquired service data according to the at least one key value.
  • the stream data processing framework may distribute a batch of metadata messages to each map module, for example, a batch of 10 metadata messages.
  • the stream data processing framework only works with 5 map modules, so a map module can be divided into 2 metadata messages.
  • the benefit of the average distribution is that each map module can complete the processing task in a similar amount of time in order to receive the distribution of the next metadata message of the stream data processing framework.
  • the service data can be stored in the distributed file system, and the map module can obtain the corresponding service data from the corresponding storage location in the distributed file system according to the storage location in the obtained metadata u message.
  • the map module determines the data relationship related to the key value from the business data according to the key value in the business requirement.
  • the key value of the business requirement is the correspondence between the user (u1)->user set (g1, g2, g3...), then the map module can determine u1->g1, u1->g2 from the service data through processing. ...and other data relationships related to key figures.
  • the stream data processing framework distributes the service data processed by the plurality of map modules to the plurality of The reduce module is configured to perform a merge process on the processed service data according to the at least one key value to obtain a merge process result.
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, similar to the Shuffle process in MapReduce.
  • the processed service data with the same key value can be sent to the same reduce module for processing.
  • the stream data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, and further includes:
  • the stream data processing framework partially merges the processed service data to compress the processed service data before distributing the processed service data to the reduce module. The amount of data of this processed business data.
  • the merge processing result of the multiple reduce modules is equivalent to the processing result for the service requirement, and the flow data processing framework may output the merge processing result, and then the system according to different scenarios and service requirements.
  • the merge processing result of the output is subjected to subsequent processing, for example, the merge processing result is stored in a persistent memory or the like.
  • the traditional MapReduce function is implemented in the stream data processing framework through the map module and the reduce module, and when receiving the service demand, it is grasped from the determined message queue.
  • the map module obtains the corresponding service data through the storage location in the metadata message, and then performs corresponding processing according to the service requirement.
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service data according to the at least one key value.
  • the merge processing is performed to obtain the merge processing result, and the merge processing result is outputted, and the real-time processing characteristics of the stream data processing framework are captured by real-time grabbing the metadata message in the message queue, and the traditional processing is no longer required before processing.
  • Pre-planning the processed data you can use the stream data that implements the MapReduce function. Fast processing mass data frame generated in real-time network.
  • the stream data processing framework may determine the timing of distributing metadata messages to the plurality of map modules by analyzing its remaining processing power.
  • FIG. 2 is a distribution quantity provided by an embodiment of the present invention.
  • the method further includes:
  • the stream data processing framework calculates the consumed capacity in real time, and the consumed capability is a processing capability used by the stream data processing framework to process metadata messages distributed to the plurality of map modules;
  • S202 The flow data processing framework determines whether a difference between the total processing capability of the flow data processing framework and the consumed capability meets a preset threshold. If yes, execute S203.
  • S203 The stream data processing framework distributes the captured metadata message to the plurality of map modules.
  • the rate of the distribution of the stream data processing framework needs to be carefully controlled. If it is too fast, the number of messages entering the subsequent Map module may increase rapidly in a short period of time, and the Map module and the Reduce module are used to process the corresponding service data.
  • the consumed system resources are beyond the capacity of the streaming data processing framework or the system in which it resides; if it is too slow, the metadata messages in the message queue are too much, and the real-time processing is reduced.
  • the stream data processing framework calculates the consumed capacity in real time.
  • the consumed capacity can be understood as the capability currently consumed by the streaming data processing framework to perform data processing, and the total processing capability of the streaming data processing framework can be understood as the maximum processing function that the streaming data processing framework can implement.
  • the ability to consume The consumed capacity and the total processing capacity can be measured by resources such as memory, CPU, and network card.
  • the waiting may be performed. After the flow data processing framework finishes processing the batch metadata message, the stream data processing framework will release the corresponding processing capability. After waiting for the difference to satisfy the preset threshold, the stream data processing framework may again distribute the captured metadata message to the plurality of map modules.
  • FIG. 3 is a structural diagram of a device of a data processing apparatus according to an embodiment of the present disclosure, which is applied to a stream data processing framework, where the stream data processing framework includes a plurality of map modules and a plurality of reduce modules, and the apparatus includes:
  • the obtaining unit 301 is configured to obtain a service requirement, where the service requirement includes at least one key value.
  • the fetching unit 302 is configured to sequentially capture a plurality of metadata messages from a message queue for storing metadata messages, where the message queue is determined by the service requirement, and the metadata message is in one-to-one correspondence with the service message.
  • the metadata message includes a storage location of the corresponding service data, and the metadata message is sequentially added to the message queue according to the generation time sequence of the corresponding service data.
  • the embodiment of the present invention provides a cache idea, which can cache the metadata message that has not been distributed after the buffering to the buffer.
  • the device optional, also includes:
  • a detecting unit configured to: before the triggering the crawling unit 302, detecting whether the buffer further holds a metadata message, where the cache saves the stream data processing framework from the message queue and has not been A metadata message that is distributed to the map module.
  • the first distribution unit 303 is triggered; if not, the capture unit 302 is triggered.
  • the first distribution unit 303 is further configured to distribute the title metadata message from the buffer to the plurality of map modules.
  • the stream data processing framework distributes the metadata message
  • the metadata message stored in the buffer is preferentially distributed.
  • the business data can be processed as much as possible in the order of the message queues to ensure the sequence.
  • an embodiment of the present invention provides a method for determining a next crawl location by recording a state of processing a metadata message.
  • it also includes:
  • a recording unit configured to record a processing status of the plurality of metadata messages in the stream data processing framework, where the processing status includes a success status and a failure status, if the service data corresponding to a metadata message is processed as a merge If the result is output, the processing state of the metadata message is a success state. If the service data corresponding to a metadata message cannot be output as a result of the merge processing within a predetermined time, the processing state of the metadata message is a failure state;
  • an update unit configured to update location information in the location register according to the processing state, where the location information in the location register is location information of the next time the stream data processing framework captures the metadata message in the message queue; If the plurality of metadata messages have a metadata message of a failed state, the location information in the location register is location information of the first metadata message, and the first metadata message is closer to the message queue a metadata message with a failure status of the queue header. If the plurality of metadata messages do not have a metadata message of a failure status, the location information in the location register is location information of the second metadata message.
  • the second metadata message is a metadata message of the plurality of metadata messages that is closer to the tail of the queue of the message queue.
  • a first distribution unit 303 configured to distribute the captured metadata message to the plurality of map modules, so that the plurality of map modules acquire corresponding service data according to the storage location in the received metadata message;
  • the plurality of map modules respectively process the acquired service data according to the at least one key value.
  • a second distribution unit 304 configured to distribute the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules are processed according to the at least one key value
  • the business data is merged to obtain a merged processing result.
  • the second distribution unit is further configured to: if the processed data volume is greater than the preset transmission amount, before processing the processed service data to the reduce module, The subsequent business data is partially combined to compress the amount of data of the processed business data.
  • the output unit 305 is configured to output the merge processing result.
  • the traditional MapReduce function is implemented in the stream data processing framework through the map module and the reduce module, and when receiving the service demand, it is grasped from the determined message queue.
  • the map module obtains the corresponding service data through the storage location in the metadata message, and then performs corresponding processing according to the service requirement.
  • the flow data processing framework distributes the service data processed by the plurality of map modules to the plurality of reduce modules, so that the plurality of reduce modules perform the processed service data according to the at least one key value.
  • the merge processing is performed to obtain the merge processing result, and the merge processing result is outputted, and the real-time processing characteristics of the stream data processing framework are captured by real-time grabbing the metadata message in the message queue, and the traditional processing is no longer required before processing.
  • Pre-planning the processed data you can use the stream data that implements the MapReduce function. Fast processing mass data frame generated in real-time network.
  • the stream data processing framework may determine the timing of distributing metadata messages to the plurality of map modules by analyzing its remaining processing power.
  • it also includes:
  • a calculating unit configured to calculate a consumed capability in real time before triggering the first distributing unit 303, where the consumed capability is used by the streaming data processing framework to process a metadata message distributed to the plurality of map modules Eliminated Consumption capacity;
  • a determining unit configured to determine whether a difference between a total processing capability of the stream data processing framework and the consumed capability meets a preset threshold
  • the first distribution unit is triggered.
  • the waiting may be performed. After the flow data processing framework finishes processing the batch metadata message, the stream data processing framework will release the corresponding processing capability. After waiting for the difference to satisfy the preset threshold, the stream data processing framework may again distribute the captured metadata message to the plurality of map modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

La présente invention concerne un procédé et un appareil de traitement de données, qui sont appliqués à une structure de traitement de données en flux. Le procédé consiste à : acquérir, au moyen d'une structure de traitement de données en flux, une exigence de service, capturer de manière séquentielle une pluralité de messages de métadonnées d'une file d'attente de messages destinée à stocker des messages de métadonnées et distribuer les messages de métadonnées capturés à une pluralité de modules d'application de telle sorte que la pluralité de modules d'application acquièrent des données de service correspondantes en fonction d'un emplacement de stockage dans les messages de métadonnées reçus; traiter de manière correspondante les données de service acquises; et distribuer les données de service traitées au moyen de la pluralité de modules d'application à une pluralité de modules réducteurs de telle sorte que la pluralité de modules réducteurs fusionnent les données de service traitées et transmettent un résultat de traitement de fusion. En capturant en temps réel un message de métadonnées dans une file d'attente de messages et en utilisant ensuite des caractéristiques de traitement en temps réel d'une structure de traitement de données en flux, la structure de traitement de données en flux mettant en œuvre une fonction MapReduce est utilisée pour traiter des données massives générées en temps réel dans un réseau et une pré-planification n'est plus nécessaire avant un traitement.
PCT/CN2016/106580 2015-12-01 2016-11-21 Procédé et appareil de traitement de données WO2017092582A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510867351.7 2015-12-01
CN201510867351.7A CN106815254B (zh) 2015-12-01 2015-12-01 一种数据处理方法和装置

Publications (1)

Publication Number Publication Date
WO2017092582A1 true WO2017092582A1 (fr) 2017-06-08

Family

ID=58796355

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/106580 WO2017092582A1 (fr) 2015-12-01 2016-11-21 Procédé et appareil de traitement de données

Country Status (2)

Country Link
CN (1) CN106815254B (fr)
WO (1) WO2017092582A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143415A (zh) * 2019-12-26 2020-05-12 政采云有限公司 一种数据处理方法、装置和计算机可读存储介质
CN111339071A (zh) * 2020-02-21 2020-06-26 苏宁云计算有限公司 一种多源异构数据的处理方法及装置
CN111400390A (zh) * 2020-04-08 2020-07-10 上海东普信息科技有限公司 数据处理方法及装置
CN111400059A (zh) * 2020-03-09 2020-07-10 五八有限公司 一种数据处理方法以及数据处理装置
CN113034178A (zh) * 2021-03-15 2021-06-25 深圳市麦谷科技有限公司 多***积分计算方法、装置、终端设备和存储介质
CN113609202A (zh) * 2021-08-11 2021-11-05 湖南快乐阳光互动娱乐传媒有限公司 数据处理方法及装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009849B (zh) * 2017-11-30 2021-12-17 北京小度互娱科技有限公司 生成账号状态的方法以及生成账号状态的装置
CN107872465A (zh) * 2017-12-05 2018-04-03 全球能源互联网研究院有限公司 一种分布式网络安全监测方法及***
CN112667411B (zh) * 2019-10-16 2022-12-13 中移(苏州)软件技术有限公司 一种数据处理的方法、装置、电子设备和计算机存储介质
CN113360463B (zh) * 2021-04-15 2024-07-05 网宿科技股份有限公司 数据处理方法、装置、服务器和可读存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530182A (zh) * 2013-10-22 2014-01-22 海南大学 一种作业调度方法和装置
CN104903894A (zh) * 2013-01-07 2015-09-09 脸谱公司 用于分布式数据库查询引擎的***和方法
CN104951509A (zh) * 2015-05-25 2015-09-30 中国科学院信息工程研究所 一种大数据在线交互式查询方法及***

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102377824A (zh) * 2011-10-19 2012-03-14 江西省南城县网信电子有限公司 一种基于云计算的空间信息服务***
CN102521232B (zh) * 2011-11-09 2014-05-07 Ut斯达康通讯有限公司 一种互联网元数据的分布式采集处理***及方法
US9639575B2 (en) * 2012-03-30 2017-05-02 Khalifa University Of Science, Technology And Research Method and system for processing data queries

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104903894A (zh) * 2013-01-07 2015-09-09 脸谱公司 用于分布式数据库查询引擎的***和方法
CN103530182A (zh) * 2013-10-22 2014-01-22 海南大学 一种作业调度方法和装置
CN104951509A (zh) * 2015-05-25 2015-09-30 中国科学院信息工程研究所 一种大数据在线交互式查询方法及***

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143415A (zh) * 2019-12-26 2020-05-12 政采云有限公司 一种数据处理方法、装置和计算机可读存储介质
CN111143415B (zh) * 2019-12-26 2023-12-29 政采云有限公司 一种数据处理方法、装置和计算机可读存储介质
CN111339071A (zh) * 2020-02-21 2020-06-26 苏宁云计算有限公司 一种多源异构数据的处理方法及装置
CN111339071B (zh) * 2020-02-21 2022-11-18 苏宁云计算有限公司 一种多源异构数据的处理方法及装置
CN111400059A (zh) * 2020-03-09 2020-07-10 五八有限公司 一种数据处理方法以及数据处理装置
CN111400059B (zh) * 2020-03-09 2023-11-14 五八有限公司 一种数据处理方法以及数据处理装置
CN111400390A (zh) * 2020-04-08 2020-07-10 上海东普信息科技有限公司 数据处理方法及装置
CN111400390B (zh) * 2020-04-08 2023-11-17 上海东普信息科技有限公司 数据处理方法及装置
CN113034178A (zh) * 2021-03-15 2021-06-25 深圳市麦谷科技有限公司 多***积分计算方法、装置、终端设备和存储介质
CN113609202A (zh) * 2021-08-11 2021-11-05 湖南快乐阳光互动娱乐传媒有限公司 数据处理方法及装置

Also Published As

Publication number Publication date
CN106815254A (zh) 2017-06-09
CN106815254B (zh) 2020-08-14

Similar Documents

Publication Publication Date Title
WO2017092582A1 (fr) Procédé et appareil de traitement de données
US11411825B2 (en) In intelligent autoscale of services
CN111124819B (zh) 全链路监控的方法和装置
CN113037823B (zh) 消息传递***和方法
JP6030144B2 (ja) 分散データストリーム処理の方法及びシステム
US8935395B2 (en) Correlation of distributed business transactions
US11902173B2 (en) Dynamic allocation of network resources using external inputs
WO2020258290A1 (fr) Procédé de collecte de données de journal, appareil de collecte de données de journal, support d'informations et système de collecte de données de journal
KR20140072044A (ko) 다중-소스 푸시 통지를 다수의 타겟들로의 분배 기법
US20120072575A1 (en) Methods and computer program products for aggregating network application performance metrics by process pool
US11570078B2 (en) Collecting route-based traffic metrics in a service-oriented system
JP2012043409A (ja) データ・ストリームを処理するためのコンピュータ実装方法、システム及びコンピュータ・プログラム
US11113244B1 (en) Integrated data pipeline
CN111586126A (zh) 小程序预下载方法、装置、设备及存储介质
US20140280610A1 (en) Identification of users for initiating information spreading in a social network
US20180081894A1 (en) Method and apparatus for clearing data in cloud storage system
US20190005534A1 (en) Providing media assets to subscribers of a messaging system
CN106126519A (zh) 媒体信息的展示方法及服务器
CN111639902A (zh) 基于kafka的数据审核方法、控制装置及计算机设备、存储介质
CN110245120B (zh) 流式计算***及流式计算***的日志数据处理方法
CN107480189A (zh) 一种多维度实时分析***及方法
US11811894B2 (en) Reduction of data transmissions based on end-user context
US20220231980A1 (en) Enhancing a social media post with content that is relevant to the audience of the post
CN116048846A (zh) 数据传输方法、装置、设备和存储介质
Racka Apache Nifi As A Tool For Stream Processing Of Measurement Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869892

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869892

Country of ref document: EP

Kind code of ref document: A1