CN111400370A - Data monitoring method and device in data circulation, storage medium and server - Google Patents

Data monitoring method and device in data circulation, storage medium and server Download PDF

Info

Publication number
CN111400370A
CN111400370A CN202010153378.0A CN202010153378A CN111400370A CN 111400370 A CN111400370 A CN 111400370A CN 202010153378 A CN202010153378 A CN 202010153378A CN 111400370 A CN111400370 A CN 111400370A
Authority
CN
China
Prior art keywords
data
demanders
demander
query
duplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010153378.0A
Other languages
Chinese (zh)
Inventor
汤奇峰
周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Data Exchange Corp
Original Assignee
Shanghai Data Exchange Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Data Exchange Corp filed Critical Shanghai Data Exchange Corp
Priority to CN202010153378.0A priority Critical patent/CN111400370A/en
Publication of CN111400370A publication Critical patent/CN111400370A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data monitoring method and device, a storage medium and a server in data circulation are provided, wherein the data monitoring method comprises the following steps: receiving data of each of a plurality of demanders, the data comprising a data identifier; counting all data identifications of the plurality of demanders to obtain total query data volume; filtering the repeated data identifications in all the data identifications of the plurality of demanders to obtain the data identifications after duplication removal; counting the data identification after the duplication removal to obtain the query data volume after the duplication removal; and judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal. The technical scheme provided by the invention can monitor the data relay distribution process and is beneficial to optimizing data circulation.

Description

Data monitoring method and device in data circulation, storage medium and server
Technical Field
The invention relates to the technical field of big data, in particular to a data monitoring method and device in data circulation, a storage medium and a server.
Background
With the rapid development of internet technology, various industries deposit massive data, the data types and data volumes required by different industries are increased, the data distribution volume is huge, the order of magnitude reaches hundreds of millions of levels, and the data circulation becomes a necessary trend.
In the process of trading a product order of a Chinese audio and video (CAP) product of a data trading platform, a special data distribution mode exists: and (6) relay distribution. The mode appears because with the increase of orders, after roughly counting a plurality of demanders to inquire data, the repetition rate of the same type of inquiry labels is very high, and a data transaction platform establishes a special distribution mode by communicating each demander, and uses a mode of aggregated purchase and distribution, which is called relay distribution.
The whole distribution process comprises two stages, namely, different demanders place orders for a pseudo supplier on a data learning and sharing Platform (D L S for short), the pseudo supplier receives order data and sends the order data to the pseudo demander, the pseudo demander places orders for an actual supplier, the stage is called an upstream stage, the actual supplier processes the data and returns the processed data to the pseudo demander, the pseudo demander splits the processed data and returns the processed data to the pseudo supplier, and the pseudo supplier transmits the processed data to the actual demander, and the stage is called a downstream stage.
In the above relay distribution flow, at least the following problems exist: (1) the tags that different demanders need to query are themselves largely duplicated. The different demanders themselves have a large amount of duplicate data when they transmit order data to the pseudo supplier. There is a large amount of duplication of data attributes purchased for the same data identifier (also known as a tag). In general, the data attributes returned by the supplier to the different demanders that are able to provide these data attributes are the same during the same time interval. (2) There are numerous duplicate tags between multiple requesters. The presence of a large number of duplicate tags in the same category of orders from a number of different requesters wastes a large amount of transmission resources and also increases time costs.
Disclosure of Invention
The technical problem solved by the invention is how to monitor the data relay distribution process so as to optimize data circulation.
To solve the foregoing technical problem, an embodiment of the present invention provides a data monitoring method in data circulation, including: receiving data of each of a plurality of demanders, the data comprising a data identifier; counting all data identifications of the plurality of demanders to obtain total query data volume; filtering the repeated data identifications in all the data identifications of the plurality of demanders to obtain the data identifications after duplication removal; counting the data identification after the duplication removal to obtain the query data volume after the duplication removal; and judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal.
Optionally, the counting the deduplicated data identifier includes: for each demander in the plurality of demanders, counting the data quantity of the deduplicated data identifier of the demander; and counting the data repetition rate of the demander, wherein the data repetition rate is equal to the ratio of the amount of the duplication-removing data of the demander to the amount of the data identification query data of the demander, and the amount of the duplication-removing data is equal to the difference between the amount of the data identification query data of the demander and the amount of the data of the demander.
Optionally, before filtering out duplicate data identifiers of all data identifiers of the plurality of demanders, the data monitoring method further includes: and for any two demanders in the plurality of demanders, counting the data repetition quantity of the any two demanders, wherein the data repetition quantity is equal to the data quantity of the intersection of the deduplicated data identifications between the any two demanders.
Optionally, the data monitoring method further includes: sending the de-duplicated data identification to a supplier, and receiving a data query result returned by the supplier; counting the data query results, and calculating the data matching rate of each demander, wherein the data matching rate is equal to the ratio of the data volume of the data query results of each demander to the number of the data identifications of the demander, and the deduplicated data identification of each demander is obtained by filtering the same data identification in all the data identifications of the demander; and the data query result of the demander is obtained by splitting and integrating the data query result returned by the supplier.
Optionally, the data query result associated with the data identifier has a preset service rule, and the data monitoring method further includes: selecting data query results which do not meet the preset service rule from the data query results returned by the supplier to obtain selected data; and counting the data quantity of the selected data.
Optionally, the data monitoring method further includes: before counting all received data identifications of a plurality of demanders, storing all the data identifications of the plurality of demanders in a data warehouse; and/or storing the data query result returned by the supplier in a data warehouse after receiving the data query result returned by the supplier.
Optionally, the counting the deduplicated data identifier includes: and counting the data identification after the duplication removal by adopting a calculation engine Spark.
Optionally, the data monitoring method further includes: the individual statistics were recorded in the Cassandra database.
In order to solve the above technical problem, an embodiment of the present invention further provides a data monitoring apparatus in data circulation, including: the system comprises a receiving module, a judging module and a sending module, wherein the receiving module is used for receiving data of each of a plurality of demanders, and the data comprises data identification; the first statistical module is used for counting all the data identifications of the plurality of demanders to obtain the total query data volume; the duplicate removal module is used for filtering the duplicate data identifiers in all the data identifiers of the plurality of demanders to obtain the duplicate removed data identifiers; the second statistical module is used for carrying out statistics on the duplicate-removed data identification to obtain the query data volume after duplicate removal; and the judging module is used for judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal.
In order to solve the above technical problem, an embodiment of the present invention further provides a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the data circulation method are executed.
In order to solve the foregoing technical problem, an embodiment of the present invention further provides a server, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the steps of the data circulation method when executing the computer instructions.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a data monitoring method in data circulation, which comprises the following steps: receiving data of each of a plurality of demanders, the data comprising a data identifier; counting all data identifications of the plurality of demanders to obtain total query data volume; filtering the repeated data identifications in all the data identifications of the plurality of demanders to obtain the data identifications after duplication removal; counting the data identification after the duplication removal to obtain the query data volume after the duplication removal; and judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal. The embodiment of the invention counts the total amount of the data identification of the demander and the data identification after the data identification is removed, can quantitatively analyze the repetition ratio of the data identification provided by the demander by comparing the statistical results before and after the data identification is removed, and further can preliminarily evaluate whether the data is abnormal, for example, when the repetition ratio of the data identification of the same demander is too high, the data of the demander can be judged to be abnormal, and the demander needs to be reminded to filter the repeated data before sending the data, thereby reducing the data transmission time. The technical scheme provided by the embodiment of the invention is beneficial to improving the quality of the data to be inquired of the demander, optimizing the data relay distribution process, and improving the data transaction quality and the data processing speed.
Further, still include: sending the de-duplicated data identification to a supplier, and receiving a data query result returned by the supplier; counting the data query results, and calculating the data matching rate of each demander, wherein the data matching rate is equal to the ratio of the data volume of the data query results of each demander to the number of data identifications of the demander; and the data query result of the demander is obtained by splitting and integrating the data query result returned by the supplier. The technical scheme provided by the invention can detect the data query result fed back by the supplier data, and can evaluate the quality of the data query result of the supplier through the counted data matching rate, thereby providing a feasible technical scheme for the demander to select a proper supplier.
Further, the data query result associated with the data identifier has a preset service rule, and the data monitoring method further includes: selecting data query results which do not meet the preset service rule from the data query results returned by the supplier to obtain selected data; and counting the data quantity of the selected data. According to the embodiment of the invention, the preset business rule is set, and the data query result fed back by the supplier is evaluated based on the preset business rule, so that the supplier can be promoted to return the data query result with higher quality.
Drawings
FIG. 1 is a flow chart of a data monitoring method in data circulation according to an embodiment of the present invention;
FIG. 2 is a simplified architectural diagram of a data flow system architecture according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data monitoring apparatus in data circulation according to an embodiment of the present invention.
Detailed Description
As a background, in the prior art, there are usually a large number of duplicate tags between the same acquirer or different acquirers, which consumes significant resource and time costs.
Firstly, the tags that different demanders need to query are themselves repeated in a large amount, and at present, no statistical task is available to record the repeated data, so that the physical resources for transmission are occupied by a large amount of repeated data.
Second, there are a large number of duplicate tags between the various requesting parties. The existing business processing flow can uniformly remove the duplication of the final result, but no statistical flow carries out intersection statistics on data between every two parties and among multiple demanders.
Again, the total amount of data sent and received by the pseudo-supplier is not counted. The total amount of data sent by the pseudo supplier to the actual supplier and the total amount of data returned by the actual supplier are not subjected to unified report statistics, although the data can be inquired in the front-end processor every time, the operation steps are complex and cannot be visually displayed. Meanwhile, the data size of each batch cannot be seen in time, and preparation for dealing with unexpected data transmission amount cannot be achieved.
Further, the matching rate of actual supplier return data is not counted. In the existing business process, after data is returned by an actual supplier, a pseudo demander carries out data processing and splitting, and then the data is delivered to the actual demander by the pseudo supplier finally, but the actual data matching rate is not counted, so that the data quality is checked and lost.
In addition, in the data returned by the supplier, the data quality may not meet the predetermined standard, and the data not meeting the quality standard may not be counted in the total number according to the agreement of the service side. The pseudo demander in the current process does not perform the statistics of the index.
To solve the foregoing technical problem, an embodiment of the present invention provides a data monitoring method in data circulation, including: receiving data of each of a plurality of demanders, the data comprising a data identifier; counting all data identifications of the plurality of demanders to obtain total query data volume; filtering the repeated data identifications in all the data identifications of the plurality of demanders to obtain the data identifications after duplication removal; counting the data identification after the duplication removal to obtain the query data volume after the duplication removal; and judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal.
The embodiment of the invention counts the total amount of the data identification of the demander and the data identification after the data identification is removed, can quantitatively analyze the repetition ratio of the data identification provided by the demander by comparing the statistical results before and after the data identification is removed, and further can preliminarily evaluate whether the data of the demander is abnormal, for example, when the repetition ratio of the data identification of the same demander is too high, the data of the demander can be judged to be abnormal, and the demander needs to be reminded to filter the repeated data before sending the data, thereby reducing the data transmission time. The technical scheme provided by the embodiment of the invention is beneficial to improving the quality of the data to be inquired of the demander, optimizing the data relay distribution process, and improving the data transaction quality and the data processing speed.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
MapReduce (MapReduce) herein is a programming model for parallel operations on large-scale datasets. The main ideas are mapping (Map) and reduction (Reduce), and have vector programming language characteristics.
HIVE may map structured data files into database tables, provide a simple structured query language (structured query L and guide, abbreviated as SQ L) query function, and convert SQ L statements into MapReduce tasks for operation.
HIVE also provides a series of tools that can be used to perform data extraction transformation loading (Extract Transform L oad, ET L), which is a mechanism that can store, query and analyze large-scale data stored in a Hadoop distributed file system.
A Hadoop Distributed File System (HDFS) is a Distributed File System suitable for running on general-purpose hardware. HDFS relaxes part of the simple operating system (POSIX) constraints to achieve the goal of streaming file system data.
HDFS was developed as an infrastructure for the Apache search engine (Apache Nutch) project, and is part of the Apache hadoop Core (Core) project. It has many similarities with existing distributed file systems. But at the same time, its distinction from other distributed file systems is also clear. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is well suited for large-scale data set applications.
The compute engine (Spark) is an open source clustered computing environment similar to Hadoop, but there are differences between the two that make Spark perform better in some workloads.
Spark enables the memory distributed data set, and can optimize the iterative workload in addition to providing interactive queries. Spark is implemented in the kaslla (Scala) language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects.
Cassandera (Cassandra) is a set of open source distributed non-relational database (NoSQ L) system, originally developed by facial books, used for storing simple format data such as inboxes, and integrates a data model of *** bigtable and a fully distributed architecture of Amazon (Amazon) engines (Dynamo).
Various exemplary embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a program segment, or a portion of code, which may include one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It should also be noted that the sequence numbers of the respective steps in the flowcharts do not represent a limitation on the execution order of the respective steps.
Fig. 1 is a schematic flowchart of a data monitoring method in data circulation according to an embodiment of the present invention. The data monitoring method can be executed by a data transaction platform, and the data transaction platform can be a server cluster formed by a plurality of servers. Specifically, the data monitoring method may include the steps of:
step S101, receiving data of each of a plurality of demanders, wherein the data comprises a data identifier;
step S102, counting all data identifications of the plurality of demanders to obtain total query data volume;
step S103, filtering repeated data identifications in all the data identifications of the plurality of demanders to obtain data identifications after duplication elimination;
step S104, counting the duplicate-removed data identification to obtain the query data volume after duplicate removal;
and step S105, judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication is removed.
More specifically, in step S101, the data transaction platform may receive data of each of a plurality of demanders and store all data identifications of the plurality of demanders in a data repository. The data may include a data identification for data queries.
In step S102, the data transaction platform may count all the data identifications of the plurality of demanders to obtain a total query data volume of the same batch. In general, the data of different batches have different initial transmission times, and the statistical data have different initial times.
In step S103, the data transaction platform may filter out the same data identifier in all the data identifiers of the multiple demanders to obtain a deduplicated data identifier, where the deduplicated data identifier is for the multiple demanders as a whole.
In a specific implementation, for each acquirer, the data transaction platform may perform deduplication on the data identifier of the acquirer to obtain a deduplicated data identifier of the acquirer.
Then, the data transaction platform may perform data identifier deduplication between the demanders for data identifiers of any two demanders of the plurality of demanders, thereby obtaining deduplicated data identifiers between any two demanders, and finally obtaining deduplicated data identifiers of the plurality of demanders.
In practical application, if the statistical result shows that the repetition proportion of the data identifiers between two demanders is too large, the data identifiers of the two demanders can be preferentially deduplicated, and thus, the reprocessing time can be greatly saved. Further, in step S104, the data transaction platform may count the deduplicated data identifiers of the multiple demanders to obtain the query data amount after deduplication.
In a non-limiting example, the data trafficking platform may use Spark to count the deduplicated data identifications.
In a specific implementation, the data transaction platform may further count the number of the duplicate-removed data identifiers and the number of the duplicate-removed data of each acquirer. Wherein, the amount of the duplication removing data refers to a difference between the amount of the data identifier of the requester before duplication removing and the amount of the data identifier of the requester after duplication removing.
The statistics can then be recorded in the Cassandra database.
Further, the data transaction platform may count the data repetition rate of each acquirer, and then may record the statistics in the Cassandra database. In a specific implementation, the data repetition rate is equal to a ratio of the amount of deduplication data of the acquirer to the amount of data identified by the data before deduplication by the acquirer. The de-duplicated data identifier of each requisition party is obtained by filtering the same data identifier in all the data identifiers of the requisition party
In a specific implementation, the data transaction platform may count the number of data repetitions of any two demanders of the plurality of demanders, where the number of data repetitions is equal to the number of data corresponding to an intersection of the deduplicated data identifiers between the any two demanders.
In step S105, the data transaction platform may determine whether the data is abnormal according to the total query data amount and the query data amount after deduplication. In one embodiment, the determining whether the data is abnormal refers to determining whether a repetition rate of the data is greater than a first preset threshold, and when the repetition rate of the data is greater than the first preset threshold, it may be determined that the data is abnormal; or, determining whether the data is abnormal refers to determining whether the repetition rate of the data is less than a second preset threshold, and when the repetition rate of the data is less than the second preset threshold, it may be determined that the data is not abnormal. Wherein the first preset threshold may be greater than the second preset threshold.
For example, a preset data repetition rate of a single acquirer may be preset, and when the counted data repetition rate of the acquirer is greater than the preset data repetition rate, it may be determined that there may be an abnormality in the data of the acquirer. At this point, the acquirer may be notified to alert the acquirer that there may be duplication of data.
Further, the data transaction platform can also send the deduplicated data identification to a supplier and receive a data query result returned by the supplier. Then, the data transaction platform may store the data query result returned by the supplier in a data warehouse, and count the data query result, and then may record the count result in a Cassandra database.
In particular implementations, the donor may be a single donor or multiple donors.
Further, the data transaction platform can split and integrate the data query result according to the data identifier of each acquirer to obtain the data query result associated with the data identifier of the acquirer.
Further, the data transaction platform may calculate a data match rate for each of the acquirer. And the data matching rate is equal to the ratio of the data volume of the data query result of each acquirer to the number of the data identifications of the acquirer. The results of this calculation can then be recorded in the Cassandra database.
In a specific implementation, the data transaction platform may preset a preset business rule of a data query result.
For example, it is assumed that the data query result obtained by the single data identifier request of the demander includes 4 data attributes, and the preset service rule requires that the data query result returned by the supplier includes at least 2 data attributes, otherwise, the data query result returned by the supplier is determined not to satisfy the preset service rule.
And then, the data transaction platform can select the data query result which does not meet the preset service rule from the data query results returned by the supplier, and count the data volume of the selected result. The statistics can then be recorded in the Cassandra database. The quality of the query result returned by the supplier can be quantitatively known by analyzing and counting the selection result, and the data quality of the query result is quantitatively evaluated.
Fig. 2 is a schematic structural diagram of a simplified data flow system architecture according to an embodiment of the present invention. Referring to fig. 2, the data circulation system architecture (hereinafter referred to as system architecture for simplicity) 200 may include: an acquirer front-end 1, a supplier front-end 2, a pseudo supplier front-end 3, a pseudo-acquirer front-end 4, and an HDFS 5.
In the data circulation process, data (such as query requests or data query results) can be circulated in each component based on a file form, so that a data distribution task is completed, and data circulation is realized.
In specific implementation, the demander front-end processor 1, the supplier front-end processor 2, the pseudo supplier front-end processor 3, the pseudo demander front-end processor 4 and the HDFS5 can control the data circulation state, record the data circulation information and the like in the data relay distribution process.
In the following, a data circulation process of data files of both data suppliers and data demanders in the data resource network is described by taking an example that a plurality of data demanders buy a single data supplier.
Specifically, referring to fig. 2, step S1 is first executed: a plurality of demanders (not shown) send respective data files (the data files include data identifiers and associated data attributes to be queried) to the pseudo supplier front-end processor 3 through respective demander front-end processors 1.
Step S2 is further executed: after the demander sends the data to the pseudo supplier front-end processor 3, all the data fall into an HIVE service incremental table (not shown), after the data collection within a preset time range is finished, the service flow triggers the data identifier to remove the duplicate, and the result file after the duplicate removal is stored in the HDFS 5.
At this time, the first stage process is completed, and Spark can be triggered to perform statistics, and the data repetition rates of the data of different demanders, the data repetition rate between every two demanders, the total number of actual tags sent by the pseudo demanders to the supplier, and the like are calculated.
Step S3 is further executed: HDFS5 may send the deduplicated result file to pseudo-supplicant front-end 4.
Step S4 is further executed: and the pseudo-demander front-end processor 4 forwards the de-duplicated result file to the supplier front-end processor 2.
Step S5 is further executed: after receiving the deduplicated result file, the supplier front-end processor 2 can retrieve the data, and return the data query result to the pseudo-demander front-end processor 4 through the supplier front-end processor 2.
Step S6 is further executed: pseudo-supplicant front-end processor 4 sends the returned data query result to HDFS 5. Specifically, the data query result returned by the supplier front-end processor 2 is stored in the HIVE business increment table, the business line triggers the splitting and reintegration of the data task after the data collection is finished, and the query result processing result file is stored in the HDFS 5.
At this time, a Spark downstream statistical task can be triggered, and the total amount of data query results returned by the supplier is calculated. And the HDFS5 distributes the data query results of all the demanders respectively according to the different split data query results of the demanders, and calculates the actual matching rate of the data query results returned by the supplier. Furthermore, statistics can be performed on data in the HIVE service incremental table, and according to a preset service rule of a service side, the number of data which do not accord with the preset service rule is counted and recorded.
Step S7 is further executed: the HDFS5 may transmit the obtained query results of the respective demanders to a specified path of the pseudo supplier front-end processor 3.
Step S8 is further executed: the pseudo-supplier front-end processor 3 sends the data of each demander to the demander front-end processor 1 corresponding to the demander.
It should be noted that all the statistical results in fig. 2 can fall into the Cassandra database, which facilitates the convenience of modification and the quick feedback of query response when the data statistics are unexpected.
For more details about the operation principle and the operation mode of the system architecture 200 shown in fig. 2, reference may be made to the related description in fig. 1, and details are not repeated here.
Therefore, by the data monitoring method provided by the embodiment of the invention, data relay distribution can be optimized, so that the data transaction quality is improved, the data circulation cost is reduced, and the data processing speed can be improved.
Fig. 3 is a schematic structural diagram of a data monitoring apparatus in data circulation according to an embodiment of the present invention. The data monitoring apparatus 3 in the data circulation (hereinafter referred to as the data monitoring apparatus 3) may be executed by a server, and implement the method shown in fig. 1.
In a specific implementation, the data monitoring apparatus 3 may include: a receiving module 31, configured to receive data of each of the plurality of demanders, where the data includes a data identifier; a first statistical module 32, configured to perform statistics on all data identifiers of the multiple demanders to obtain a total query data volume; the duplicate removal module 33 is configured to filter duplicate data identifiers in all the data identifiers of the multiple demanders to obtain duplicate removed data identifiers; a second statistical module 34, configured to perform statistics on the duplicate-removed data identifier to obtain a query data amount after duplicate removal; and the judging module 35 is configured to judge whether the data is abnormal according to the total query data amount and the query data amount after deduplication.
For more details of the operation principle and the operation mode of the data monitoring apparatus 3 shown in fig. 3, reference may be made to the related description in fig. 1 and fig. 2, and details are not repeated here.
Further, the embodiment of the present invention also discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solution of the method described in the embodiments shown in fig. 1 and fig. 2 is executed. Preferably, the storage medium may include a computer-readable storage medium such as a non-volatile (non-volatile) memory or a non-transitory (non-transient) memory. The computer readable storage medium may include ROM, RAM, magnetic or optical disks, and the like.
Further, an embodiment of the present invention further discloses a server, which includes a memory and a processor, where the memory stores computer instructions capable of being executed on the processor, and the processor executes the computer instructions to execute the technical solutions of the methods in the embodiments shown in fig. 1 and fig. 2.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (11)

1. A data monitoring method in data circulation is characterized by comprising the following steps:
receiving data of each of a plurality of demanders, the data comprising a data identifier;
counting all data identifications of the plurality of demanders to obtain total query data volume;
filtering the repeated data identifications in all the data identifications of the plurality of demanders to obtain the data identifications after duplication removal;
counting the data identification after the duplication removal to obtain the query data volume after the duplication removal;
and judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal.
2. The data monitoring method of claim 1, wherein the counting the deduplicated data identifiers comprises:
for each demander in the plurality of demanders, counting the data volume of the data identifier after the demanders have removed the weight;
and counting the data repetition rate of the demander, wherein the data repetition rate is equal to the ratio of the data quantity of the de-duplication data of the demander to the data quantity of the data identifier before de-duplication of the demander, and the data quantity of the de-duplication data of the demander is equal to the difference between the data quantity of the data identifier before de-duplication of the demander and the data quantity of the data identifier after de-duplication of the demander.
3. The data monitoring method according to claim 1, wherein before filtering out duplicate data identifiers of all data identifiers of the plurality of demanders, the data monitoring method further comprises:
for any two demanders in the plurality of demanders, counting the data duplication quantity of the any two demanders, wherein the data duplication quantity is equal to the data quantity of the intersection of the data identifications after the duplication is removed between the any two demanders, and the data identification after the duplication is removed of each demander is obtained by filtering the same data identification in all the data identifications of the demanders.
4. The data monitoring method of claim 1, further comprising:
sending the de-duplicated data identification to a supplier, and receiving a data query result returned by the supplier;
counting the data query results, and calculating the data matching rate of each demander, wherein the data matching rate is equal to the ratio of the data volume of the data query result of each demander to the data volume of the data identifier of the demander after the duplication is removed, and the data identifier of each demander is obtained by filtering the same data identifier in all the data identifiers of the demander;
and the data query result of the demander is obtained by splitting and integrating the data query result returned by the supplier.
5. The data monitoring method according to claim 4, wherein the data query result associated with the data identifier has a preset business rule, and the data monitoring method further comprises:
selecting data query results which do not meet the preset service rule from the data query results returned by the supplier to obtain selected data;
and counting the data quantity of the selected data.
6. The data monitoring method of claim 4, further comprising:
before counting all received data identifications of a plurality of demanders, storing all the data identifications of the plurality of demanders in a data warehouse; and/or the presence of a gas in the gas,
and after receiving the data query result returned by the supplier, storing the data query result returned by the supplier in a data warehouse.
7. The data monitoring method of claim 1, wherein the counting the deduplicated data identifiers comprises:
and counting the data identification after the duplication removal by adopting a calculation engine Spark.
8. The data monitoring method according to any one of claims 1 to 7, further comprising:
the individual statistics were recorded in the Cassandra database.
9. A data monitoring device in data flow, comprising:
the system comprises a receiving module, a judging module and a sending module, wherein the receiving module is used for receiving data of each of a plurality of demanders, and the data comprises data identification;
the first statistical module is used for counting all the data identifications of the plurality of demanders to obtain the total query data volume;
the duplicate removal module is used for filtering the duplicate data identifiers in all the data identifiers of the plurality of demanders to obtain the duplicate removed data identifiers;
the second statistical module is used for carrying out statistics on the duplicate-removed data identification to obtain the query data volume after duplicate removal;
and the judging module is used for judging whether the data is abnormal or not according to the total query data volume and the query data volume after the duplication removal.
10. A storage medium having stored thereon computer instructions, characterized in that the computer instructions are operative to perform the steps of the method of any one of claims 1 to 8.
11. A server comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any one of claims 1 to 8.
CN202010153378.0A 2020-03-06 2020-03-06 Data monitoring method and device in data circulation, storage medium and server Withdrawn CN111400370A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010153378.0A CN111400370A (en) 2020-03-06 2020-03-06 Data monitoring method and device in data circulation, storage medium and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010153378.0A CN111400370A (en) 2020-03-06 2020-03-06 Data monitoring method and device in data circulation, storage medium and server

Publications (1)

Publication Number Publication Date
CN111400370A true CN111400370A (en) 2020-07-10

Family

ID=71432248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010153378.0A Withdrawn CN111400370A (en) 2020-03-06 2020-03-06 Data monitoring method and device in data circulation, storage medium and server

Country Status (1)

Country Link
CN (1) CN111400370A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111866164A (en) * 2020-07-29 2020-10-30 钱秀英 Information acquisition system and method for data transmission among communication devices

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559282A (en) * 2013-11-07 2014-02-05 北京国双科技有限公司 Real-time system data reduplication removing method and device
CN107832341A (en) * 2017-10-12 2018-03-23 千寻位置网络有限公司 AGNSS user's duplicate removal statistical method
CN108573026A (en) * 2018-03-14 2018-09-25 上海数据交易中心有限公司 A kind of data circulation method and device, storage medium, server
CN108920581A (en) * 2018-06-25 2018-11-30 上海数据交易中心有限公司 Data circulation method and device, storage medium, server
WO2019100614A1 (en) * 2017-11-22 2019-05-31 平安科技(深圳)有限公司 Buried point data processing method, device, computer device and storage medium
CN110533450A (en) * 2019-07-17 2019-12-03 上海数据交易中心有限公司 Data circulation method and device, storage medium, server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559282A (en) * 2013-11-07 2014-02-05 北京国双科技有限公司 Real-time system data reduplication removing method and device
CN107832341A (en) * 2017-10-12 2018-03-23 千寻位置网络有限公司 AGNSS user's duplicate removal statistical method
WO2019100614A1 (en) * 2017-11-22 2019-05-31 平安科技(深圳)有限公司 Buried point data processing method, device, computer device and storage medium
CN108573026A (en) * 2018-03-14 2018-09-25 上海数据交易中心有限公司 A kind of data circulation method and device, storage medium, server
CN108920581A (en) * 2018-06-25 2018-11-30 上海数据交易中心有限公司 Data circulation method and device, storage medium, server
CN110533450A (en) * 2019-07-17 2019-12-03 上海数据交易中心有限公司 Data circulation method and device, storage medium, server

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
EILEEN KUEHN,MAX FISCHER, CHRISTOPHER JUNG, ANDREAS PETZOLD AND: "Monitoring Data Streams at Process Level in Scientific Big Data Batch Clusters", 《2014 IEEE/ACM INTERNATIONAL SYMPOSIUM ON BIG DATA COMPUTING》 *
宋怀明等: "大规模数据密集型***中的去重查询优化", 《计算机研究与发展》 *
胡洋: "ADS-B 数据处理中心的设计与实现", 《网络与信息工程》 *
褚福银等: "基于hadoop平台海量数据的快速查询与实现", 《电脑知识与技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111866164A (en) * 2020-07-29 2020-10-30 钱秀英 Information acquisition system and method for data transmission among communication devices
CN111866164B (en) * 2020-07-29 2021-05-07 广州伊智信息科技有限公司 Information acquisition system and method for data transmission among communication devices

Similar Documents

Publication Publication Date Title
US20200151179A1 (en) Parallel processing database system
US20170242887A1 (en) Efficient access scheduling for super scaled stream processing systems
US8719271B2 (en) Accelerating data profiling process
CN107818431B (en) Method and system for providing order track data
CN108694195B (en) Management method and system of distributed data warehouse
CN111209352B (en) Data processing method and device, electronic equipment and storage medium
US9135071B2 (en) Selecting processing techniques for a data flow task
US10216782B2 (en) Processing of updates in a database system using different scenarios
CN111539633A (en) Service data quality auditing method, system, device and storage medium
WO2014058711A1 (en) Creation of inverted index system, and data processing method and apparatus
CN113360554B (en) Method and equipment for extracting, converting and loading ETL (extract transform load) data
CN105824868A (en) Distributed type database data processing method and distributed type database system
CN112231402A (en) Real-time synchronization method, device, equipment and storage medium for heterogeneous data
US11887013B2 (en) System and method for facilitating model-based classification of transactions
CN111860667A (en) Method and device for determining equipment fault, storage medium and electronic device
CN111767320A (en) Data blood relationship determination method and device
CN111680017A (en) Data synchronization method and device
CN107704357B (en) Log generation method and device
CN112579578A (en) Metadata-based data quality management method, device and system and server
CN111400370A (en) Data monitoring method and device in data circulation, storage medium and server
CN113220907A (en) Business knowledge graph construction method and device, medium and electronic equipment
CN107679096B (en) Method and device for sharing indexes among data marts
CN113220530B (en) Data quality monitoring method and platform
CN115599871A (en) Lake and bin integrated data processing system and method
CA3018881C (en) Method and system for persisting data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200710