CN109408711B - Data filtering method and device, electronic equipment and storage medium - Google Patents

Data filtering method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109408711B
CN109408711B CN201811150166.6A CN201811150166A CN109408711B CN 109408711 B CN109408711 B CN 109408711B CN 201811150166 A CN201811150166 A CN 201811150166A CN 109408711 B CN109408711 B CN 109408711B
Authority
CN
China
Prior art keywords
data
identification information
newly added
broadcast variable
broadcast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811150166.6A
Other languages
Chinese (zh)
Other versions
CN109408711A (en
Inventor
刘万强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Liangxin Technology Co., Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201811150166.6A priority Critical patent/CN109408711B/en
Publication of CN109408711A publication Critical patent/CN109408711A/en
Priority to CA3057038A priority patent/CA3057038C/en
Application granted granted Critical
Publication of CN109408711B publication Critical patent/CN109408711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

the embodiment of the invention provides a data filtering method, a data filtering device, electronic equipment and a storage medium, and relates to the technical field of big data. The method comprises the following steps: generating a broadcast variable based on the identification information of a plurality of pieces of data in a first data table, and broadcasting the broadcast variable to each working node; extracting identification information of newly added data generated by the working node, and determining whether the identification information of the newly added data exists in the broadcast variable; and responding to the identification information of the newly added data existing in the broadcast variable, and filtering the corresponding newly added data to the elastic distributed data set to be processed. The embodiment of the invention can solve the problem that the temporary table occupies a large amount of memory during data filtering, so that the data processing efficiency is influenced by delay.

Description

Data filtering method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to a data filtering method, a data filtering device, electronic equipment and a computer readable storage medium.
background
With the rapid development of the internet technology, a big data era is promoted, massive real-time data comes with the big data, and the data is not updated and iterated anytime and anywhere, so that a data filtering technology also comes up.
Currently, in the related data filtering technology, newly added monitoring alarm data in Kafka is read through a Spark program, the newly added data directly generates a temporary table through Spark SQL, join (connection) query is performed on the temporary table and an alarm table in a database, and a join query result is inserted into the database. However, directly converting the newly added data into the temporary table occupies a large amount of memory space, and a large amount of read-write operations are generated when two data tables are subjected to join query, so that delay often occurs, and data processing efficiency is affected.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the invention and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
Embodiments of the present invention are directed to a data filtering method, a data filtering apparatus, an electronic device, and a computer-readable storage medium, which overcome at least some of the problems that a large amount of space is occupied and delay often occurs during data filtering due to limitations and disadvantages of the related art.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to a first aspect of the embodiments of the present invention, there is provided a data filtering method, including: generating a broadcast variable based on the identification information of a plurality of pieces of data in a first data table, and broadcasting the broadcast variable to each working node; extracting identification information of newly added data generated by the working node, and determining whether the identification information of the newly added data exists in the broadcast variable; and responding to the identification information of the newly added data existing in the broadcast variable, and filtering the corresponding newly added data to the elastic distributed data set to be processed.
In some embodiments of the present invention, based on the foregoing solution, generating a broadcast variable based on identification information of a plurality of pieces of data in a first data table includes: acquiring identification information of a plurality of pieces of data in a first data table; taking identification information of each piece of data as a first keyword, and performing hash operation on the first keyword to generate a bit set corresponding to the identification information; and generating a broadcast variable by taking the BitSet as initial data.
in some embodiments of the present invention, determining whether the identification information of the new addition data exists in the broadcast variable based on the foregoing scheme includes: taking the identification information of the newly added data as a second keyword, and performing the hash operation on the second keyword; and judging whether the second keyword exists in the BitSet or not based on the result of the Hash operation.
In some embodiments of the present invention, based on the foregoing solution, the data filtering method further includes: and generating a temporary table based on the elastic distributed data set to be processed, and performing connection query on the temporary table and a second data table.
In some embodiments of the present invention, based on the foregoing solution, generating a temporary table based on the elastic distributed data set to be processed includes: creating a sub-thread, and converting the elastic distributed data set to be processed into a data frame DataFrame through the sub-thread; and generating a temporary table based on the DataFrame.
In some embodiments of the present invention, based on the data filtering method in the foregoing scheme, the first data table is an alarm rule table, and the identification information is a date, an IP address, and an alarm type.
In some embodiments of the present invention, based on the data filtering method of the foregoing scheme, the hash operation is a murmurmurr hash operation.
According to a second aspect of the embodiments of the present invention, there is provided a data filtering method, including: the broadcasting unit is used for generating broadcasting variables based on the identification information of the data in the first data table and broadcasting the broadcasting variables to each working node; the judging unit is used for extracting the identification information of the newly added data generated by the working node and determining whether the identification information of the newly added data exists in the broadcast variable; and the filtering unit is used for responding to the identification information of the newly added data existing in the broadcast variable and filtering the corresponding newly added data to the elastic distributed data set to be processed.
According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including: a processor; and a memory having computer readable instructions stored thereon, the computer readable instructions when executed by the processor implementing the data filtering method of any of the above embodiments.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the data filtering method as described in the above embodiments.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
In the technical solutions provided by some embodiments of the present invention, a broadcast variable is generated according to identification information of a plurality of pieces of data in a first data table and is broadcast to each working node; and after the newly added data are generated, extracting identification information of the newly added data and determining whether the extracted identification information exists in the broadcast variable, and if so, filtering the corresponding newly added data into the elastic distributed data set to be processed. On one hand, the broadcast variables generated according to the identification information of the data in the first data table are broadcast to each working node, so that all tasks of the actuator process can share one data, and the copying of a large amount of data is avoided; on the other hand, the newly added data of each working node is filtered based on the broadcast variables, and a temporary table is generated based on the filtered data, so that the memory space occupied by the temporary table can be reduced; on the other hand, the generated temporary table and the second data table are connected for query, so that a large amount of read-write operations are reduced, the occurrence of delay is avoided, and the data processing efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 schematically illustrates a schematic diagram of a data filtering method flow, according to some embodiments of the invention;
FIG. 2 schematically illustrates a schematic diagram of a data filtering flow, according to some embodiments of the invention;
FIG. 3 schematically illustrates a schematic diagram of a data filtering apparatus according to some embodiments of the invention;
FIG. 4 schematically illustrates a structural diagram of a computer of an electronic device, in accordance with some embodiments of the present invention;
FIG. 5 schematically illustrates a schematic diagram of a computer-readable storage medium according to some embodiments of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
the flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
in the present exemplary embodiment, a data filtering method is first provided, and fig. 1 schematically illustrates a flow diagram of a data filtering method according to some embodiments of the present invention. Referring to fig. 1, the data filtering method includes the steps of:
Step S110, generating a broadcast variable based on the identification information of a plurality of pieces of data in a first data table, and broadcasting the broadcast variable to each working node;
Step S120, extracting the identification information of the newly added data generated by the working node, and determining whether the identification information of the newly added data exists in the broadcast variable;
step S130, responding to the identification information of the newly added data existing in the broadcast variable, and filtering the corresponding newly added data to the elastic distributed data set to be processed.
According to the data filtering method in the exemplary embodiment, on one hand, the broadcast variables generated according to the identification information of the plurality of pieces of data in the first data table are broadcast to each working node, so that all tasks of the executor process can share one piece of data, and the copying of a large amount of data is avoided; on the other hand, the newly added data of each working node is filtered based on the broadcast variables, and a temporary table is generated based on the filtered data, so that the memory space occupied by the temporary table can be reduced; on the other hand, the generated temporary table and the second data table are connected for query, so that a large amount of read-write operations are reduced, the occurrence of delay is avoided, and the data processing efficiency is improved.
Next, the data filtering method of the present exemplary embodiment will be further explained.
referring to fig. 1, in step S110, a broadcast variable is generated based on identification information of a plurality of pieces of data in a first data table and is broadcast to each of the working nodes.
In an exemplary embodiment of the present invention, a plurality of pieces of data in the first data table are acquired, and identification information included in each piece of data is extracted. For example, when the first data table is an alarm rule table, the alarm rule table includes an alarm type field, and the extracted identification information of the data in the first data table may be alarm type information, such as CPU load, excessive memory usage, excessive disk usage, and the like. In addition, the alarm rule table may further include a date field and an IP address field, and the extracted identification information may be date, IP address or alarm type information included in each piece of data.
Further, a first key may be generated according to the identification information of each piece of data, a bit set corresponding to the identification information is generated by performing a hash operation on the first key, that is, a bloom filter is formed, and then the obtained BitSet is used as initial data to generate a broadcast variable. The Broadcast variable Broadcast is a shared variable in Spark, and all tasks of the executor process share one piece of data through the Broadcast variable, so that the copying of the data can be reduced. And a bloom filter is used in the process of generating the broadcast variables, so that the memory occupation space of the broadcast variables can be obviously reduced.
It should be noted that, in the present exemplary embodiment, although the first data table is taken as an example of the alarm rule table, in the exemplary embodiment of the present invention, the first data table may also be other suitable data tables, such as a monitoring index table or a filtering rule table, and the like, which is also within the protection scope of the present invention. The historical data in the first data table as well as the newly added data may be saved in a MySQL database.
In step S120, the identification information of the new data generated by the working node is extracted, and it is determined whether the identification information of the new data exists in the broadcast variable.
In an exemplary embodiment of the present invention, before the newly added data generated by the working node forms a to-be-processed RDD (flexible Distributed Dataset), it may be determined whether the identification information of the newly added data exists in the broadcast variable through a bloom filter Bloomfilter. For example, the identification information of the newly added data is extracted, the identification information of the newly added data is used as a second Key, a corresponding Key is generated by performing hash operation on the second Key, and then whether the Key exists in the broadcast variable is judged. Specifically, the second key is used as an input of K hash functions, for example, 3 hash functions, to obtain K array positions, for example, positions in Bitset, and if any one of the array positions is 0, it is determined that the second key is not in the broadcast variable.
In the present exemplary embodiment, the hash operation may be a murmururhash operation. MurmurHash is a non-encryption type hash function, is suitable for general hash retrieval operation, has higher balance and low collision rate for complex data, and is a function capable of realizing a bloom filter BloomFilter.
It should be noted that the Hash operation in the exemplary embodiment of the present invention may also adopt other suitable Hash operations, such as a cityhashh operation, a spookyhashh operation, or an FNV Hash operation, which is not particularly limited in the present invention.
In step S130, in response to the identification information of the new added data existing in the broadcast variable, filtering the corresponding new added data to the elastic distributed data set to be processed.
In an example embodiment of the present invention, if it is determined that a Key generated by the hash operation on the second keyword exists in the broadcast variable, filtering new data corresponding to the second keyword into the elastic distributed data set to be processed; and if the second keyword is not in the broadcast variables, directly ignoring the new data corresponding to the second keyword.
in an exemplary embodiment of the present invention, after filtering the newly added data into the elastic distributed data set to be processed, a first sub-thread is created, and the elastic distributed data set to be processed is converted into a data frame DataFrame through the first sub-thread, where the DataFrame is a structured data, and is stored in a designated column, and similar to a table in a conventional database, the data can be processed in an SQL manner. And generating a temporary table according to the converted DataFrame, and then performing connection query, namely data duplicate checking, on the obtained temporary table and a second data table, so that repeated data can be screened out, required data can be obtained, and the filtering effect is ensured to be more accurate. The connection query refers to extracting data from two tables and combining the data into new data if relevant fields of the two tables meet a connection condition, for example, a first data table and a second data table are subjected to connection query, and the query result is fields with the same alarm type.
in an exemplary embodiment of the present invention, the temporary table generated according to the converted DataFrame is connected to the second data table to query and obtain the required data, and the required data is inserted into the target database after being subjected to operator operation.
In an exemplary embodiment of the present invention, a second sub-thread is created, identification information of a plurality of pieces of updated data in the first data table is obtained through the second sub-thread, the identification information of each piece of updated data is used as a first key, a BitSet corresponding to the identification information is generated through a hash operation, and then a new broadcast variable is generated and broadcast to each working node by using the BitSet as initial data, so as to complete updating of the broadcast variable.
further, if the data in the first data table is updated periodically, the broadcast variable may be updated periodically, that is, the updated data in the first data table is read within a specified time interval, the identification information of the data is extracted, a BitSet corresponding to the extracted identification information is generated through a hash operation, and then the BitSet is used as initial data to generate a new broadcast variable.
FIG. 2 schematically illustrates a schematic block diagram of a data filtering flow, in accordance with some embodiments of the invention;
Referring to fig. 2, in step S201, the Streaming Context in Spark is initialized.
In step S202, data in the first data table, that is, the alarm rule table, is read, identification information of each piece of data, that is, "date + IP address + alarm type", is extracted, and a BitSet is generated using the identification information of the data as a first key.
in step S203, a Broadcast variable Broadcast is generated with BitSet as initial data.
In step S204, the newly added data of the working node is read, and a second keyword is generated according to the identification information of each piece of data, i.e., "date + IP address + alarm type".
In step S205, it is determined whether the Key generated in step S204 is present in the broadcast variable.
In step S206, if it is determined that the key generated in step S204 exists in the broadcast variable, a first child thread, i.e., an operator operation thread, is created.
In step S207, in the operator operation thread, the filtered elastic distributed data set is converted into a DataFrame, a temporary table is generated, and the temporary table is connected with the existing data in the alarm table on the same day to perform query, so as to obtain data to be put into a database and insert the data into a target database.
In step S208, the storage result in step S207 is returned, and the filtering of the data is completed.
In step S209, a second child thread, i.e., a broadcast variable update thread, is created.
In step S210, in the broadcast variable update thread, the data in the alarm rule table is read again, the identification information of each piece of data is used as the first keyword key to generate a new BitSet, and the new BitSet is used to assign a value to the broadcast variable again, thereby completing the update of the broadcast variable.
In step S211, the update result of step S210 is returned, and the update of the broadcast variable is completed.
it is noted that although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
In addition, in the embodiment of the invention, a data filtering device is also provided. Fig. 3 schematically illustrates a schematic block diagram of a data filtering apparatus according to some embodiments of the present invention, and referring to fig. 3, the data filtering apparatus 300 includes: a broadcasting unit 310, a judging unit 320, and a filtering unit 330. Wherein: the broadcasting unit 310 is configured to generate a broadcast variable based on the identification information of the pieces of data in the first data table, and broadcast the broadcast variable to each working node; the judging unit 320 is configured to extract identification information of newly added data generated by the working node, and determine whether the identification information of the newly added data exists in the broadcast variable; the filtering unit 330 is configured to filter the corresponding new data to the to-be-processed flexible distributed data set in response to the identification information of the new data existing in the broadcast variable.
in an example embodiment of the present invention, based on the foregoing scheme, the broadcasting unit 310 is configured to: acquiring identification information of a plurality of pieces of data in a first data table; using identification information of each piece of data as a first keyword, and generating a bit set corresponding to the identification information through Hash operation; and generating a broadcast variable by taking the BitSet as initial data.
In an example embodiment of the present invention, based on the foregoing scheme, the determining unit 320 is configured to: taking the identification information of the newly added data as a second keyword, and performing the hash operation on the second keyword; and judging whether the second keyword exists in the BitSet or not based on the result of the Hash operation.
In an example embodiment of the present invention, the data filtering apparatus further includes: and the connection query unit is used for generating a temporary table based on the elastic distributed data set to be processed and performing connection query on the temporary table and a second data table.
in an example embodiment of the present invention, based on the foregoing solution, the connection querying unit is configured to: creating a sub-thread, and converting the elastic distributed data set to be processed into a data frame DataFrame through the sub-thread; and generating a temporary table based on the DataFrame.
The specific details of each module of the data filtering device have been described in detail in the corresponding data filtering method, and therefore are not described herein again.
it should be noted that although several modules or units of the data filtering apparatus 300 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
in an exemplary embodiment of the present invention, an electronic device capable of implementing the data filtering method is also provided. Referring now to FIG. 4, FIG. 4 schematically illustrates a computer system 400 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device 400 shown in fig. 4 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present invention.
as will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, an electronic device according to the invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps of the data filtering method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit may execute step S110 shown in fig. 1, generate a broadcast variable based on the identification information of the plurality of pieces of data in the first data table, and broadcast the broadcast variable to the respective work nodes; step S120, extracting the identification information of the newly added data generated by the working node, and determining whether the identification information of the newly added data exists in the broadcast variable; step S130, responding to the identification information of the newly added data existing in the broadcast variable, and filtering the corresponding newly added data to the elastic distributed data set to be processed.
As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: the at least one processing unit 401, the at least one memory unit 402, and a bus 403 that connects the various system components (including the memory unit 402 and the processing unit 401).
Bus 403 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
The storage unit 402 may include readable media in the form of volatile memory, such as a Random Access Memory (RAM)4021 and/or a cache memory 4022, and may further include a Read Only Memory (ROM) 4023.
the storage unit 402 may also include a program/utility 4025 having a set (at least one) of program modules 4024, such program modules 4024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Electronic device 400 may also communicate with one or more external devices 404 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with electronic device 400, and/or with any devices (e.g., router, modem, etc.) that enable electronic device 400 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 405. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 406. As shown, the network adapter 406 communicates with the other modules of the electronic device 400 over a bus 403. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
it should be noted that although in the above detailed description several units/modules or sub-units/modules of the data filtering device are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
As another aspect, the present application also provides a computer-readable medium 500, where the computer-readable medium 500 may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium 500 carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the data filtering method as described in the above embodiments.
It should be noted that although in the above detailed description several modules or units of a device or apparatus for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
it will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (9)

1. A Spark-based data filtering method is characterized by comprising the following steps:
Generating a broadcast variable based on the identification information of a plurality of pieces of data in a first data table, and broadcasting the broadcast variable to each working node;
Extracting identification information of newly added data generated by the working node, and determining whether the identification information of the newly added data exists in the broadcast variable;
responding to the identification information of the newly added data existing in the broadcast variable, and filtering the corresponding newly added data to a to-be-processed elastic distributed data set;
The generating of the broadcast variable based on the identification information of the plurality of pieces of data in the first data table includes:
Acquiring identification information of a plurality of pieces of data in a first data table;
Taking identification information of each piece of data as a first keyword, and performing hash operation on the first keyword to generate a bit set corresponding to the identification information;
And generating a broadcast variable by taking the BitSet as initial data.
2. The data filtering method of claim 1, wherein determining whether the identification information of the new added data exists in the broadcast variable comprises:
Taking the identification information of the newly added data as a second keyword, and performing the hash operation on the second keyword;
And judging whether the second keyword exists in the BitSet or not based on the result of the Hash operation.
3. the data filtering method of claim 1, further comprising:
and generating a temporary table based on the elastic distributed data set to be processed, and performing connection query on the temporary table and a second data table.
4. The data filtering method according to claim 3, wherein generating a temporary table based on the elastic distributed data set to be processed comprises:
Creating a sub-thread, and converting the elastic distributed data set to be processed into a data frame DataFrame through the sub-thread;
And generating a temporary table based on the DataFrame.
5. The data filtering method according to any one of claims 1 to 4, wherein the first data table is an alarm rule table, and the identification information is a date, an IP address, and an alarm type.
6. the data filtering method according to claim 1, wherein the hash operation is a murmurmurhash operation.
7. A Spark-based data filtering device, comprising:
The broadcasting unit is used for generating broadcasting variables based on the identification information of the data in the first data table and broadcasting the broadcasting variables to each working node;
the judging unit is used for extracting the identification information of the newly added data generated by the working node and determining whether the identification information of the newly added data exists in the broadcast variable;
The filtering unit is used for responding to the fact that the identification information of the newly added data exists in the broadcast variable, and filtering the corresponding newly added data to a to-be-processed elastic distributed data set;
The generating of the broadcast variable based on the identification information of the plurality of pieces of data in the first data table includes:
Acquiring identification information of a plurality of pieces of data in a first data table;
Taking identification information of each piece of data as a first keyword, and performing hash operation on the first keyword to generate a bit set corresponding to the identification information;
And generating a broadcast variable by taking the BitSet as initial data.
8. An electronic device, comprising:
a processor; and
A memory having stored thereon computer readable instructions which, when executed by the processor, implement the data filtering method of any one of claims 1 to 6.
9. a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the data filtering method according to any one of claims 1 to 6.
CN201811150166.6A 2018-09-29 2018-09-29 Data filtering method and device, electronic equipment and storage medium Active CN109408711B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811150166.6A CN109408711B (en) 2018-09-29 2018-09-29 Data filtering method and device, electronic equipment and storage medium
CA3057038A CA3057038C (en) 2018-09-29 2019-09-27 Data filtering method, apparatus, electronic apparatus and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811150166.6A CN109408711B (en) 2018-09-29 2018-09-29 Data filtering method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109408711A CN109408711A (en) 2019-03-01
CN109408711B true CN109408711B (en) 2019-12-06

Family

ID=65465729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811150166.6A Active CN109408711B (en) 2018-09-29 2018-09-29 Data filtering method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN109408711B (en)
CA (1) CA3057038C (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241163A (en) * 2020-01-17 2020-06-05 平安科技(深圳)有限公司 Distributed computing task response method and device
CN112163176A (en) * 2020-11-02 2021-01-01 北京城市网邻信息技术有限公司 Data storage method and device, electronic equipment and computer readable medium
WO2022155920A1 (en) * 2021-01-22 2022-07-28 Oppo广东移动通信有限公司 Information transmission method and apparatus, and device and storage medium
CN115941327A (en) * 2022-12-08 2023-04-07 西安交通大学 Multilayer malicious URL identification method based on learning type bloom filter

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408190A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Spark based data processing method and device
CN106372190A (en) * 2016-08-31 2017-02-01 华北电力大学(保定) Method and device for querying OLAP (on-line analytical processing) in real time
CN107015989A (en) * 2016-01-27 2017-08-04 博雅网络游戏开发(深圳)有限公司 Data processing method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN107220261B (en) * 2016-03-22 2020-10-30 ***通信集团山西有限公司 Real-time mining method and device based on distributed data
CN106296305A (en) * 2016-08-23 2017-01-04 上海海事大学 Electric business website real-time recommendation System and method under big data environment
US10176092B2 (en) * 2016-09-21 2019-01-08 Ngd Systems, Inc. System and method for executing data processing tasks using resilient distributed datasets (RDDs) in a storage device
CN106611064B (en) * 2017-01-03 2020-03-06 北京华胜信泰数据技术有限公司 Data processing method and device for distributed relational database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408190A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Spark based data processing method and device
CN107015989A (en) * 2016-01-27 2017-08-04 博雅网络游戏开发(深圳)有限公司 Data processing method and device
CN106372190A (en) * 2016-08-31 2017-02-01 华北电力大学(保定) Method and device for querying OLAP (on-line analytical processing) in real time

Also Published As

Publication number Publication date
CA3057038C (en) 2023-06-27
CN109408711A (en) 2019-03-01
CA3057038A1 (en) 2020-03-29

Similar Documents

Publication Publication Date Title
CN109408711B (en) Data filtering method and device, electronic equipment and storage medium
EP3937027B1 (en) Method and apparatus for processing label data, device, and storage medium
US8949222B2 (en) Changing the compression level of query plans
CN110162512B (en) Log retrieval method, device and storage medium
US11132362B2 (en) Method and system of optimizing database system, electronic device and storage medium
CN107480260B (en) Big data real-time analysis method and device, computing equipment and computer storage medium
CN108628972B (en) Data table processing method and device and storage medium
CN111047434B (en) Operation record generation method and device, computer equipment and storage medium
CN109361553B (en) Configuration rollback method and device
CN112800091B (en) Flow batch integrated calculation control system and method
CN109063210B (en) Resource object query method, device, equipment and storage medium of storage system
CN108491294B (en) Database backup method, device and system
CN107330031B (en) Data storage method and device and electronic equipment
CN109697234B (en) Multi-attribute information query method, device, server and medium for entity
CN116595044A (en) Optimization method, storage medium and equipment for database selectivity calculation
CN116069810A (en) Data query method and device and terminal equipment
CN107203550B (en) Data processing method and database server
JP7097408B2 (en) Methods, devices, electronic devices and storage media for treating local hotspots
CN110083438B (en) Transaction distribution method, device, equipment and storage medium
CN112487111A (en) Data table association method and device based on KV database
CN110825477A (en) Method, device and equipment for loading graphical interface and storage medium
JP2016170453A (en) Data storage control apparatus, data storage control system, data storage control method, and data storage control program
CN111090629B (en) Data file storage method, device, equipment and storage medium
CN108399246B (en) Target data positioning method and related device
CN115185639B (en) Method and system for realizing virtualized API (application program interface)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200624

Address after: Room 301, building 2, No. 18, Tianshan West Road, Changning District, Shanghai, 200335

Patentee after: Shanghai Liangxin Technology Co., Ltd

Address before: 100083 Beijing Haidian District North Fourth Ring Road West, No. 9 2106-030

Patentee before: BEIJING SANKUAI ONLINE TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right