CN106776901B - Data extraction method, device and system - Google Patents

Data extraction method, device and system Download PDF

Info

Publication number
CN106776901B
CN106776901B CN201611080168.3A CN201611080168A CN106776901B CN 106776901 B CN106776901 B CN 106776901B CN 201611080168 A CN201611080168 A CN 201611080168A CN 106776901 B CN106776901 B CN 106776901B
Authority
CN
China
Prior art keywords
data
key
data type
value
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611080168.3A
Other languages
Chinese (zh)
Other versions
CN106776901A (en
Inventor
蔡自彬
何金良
李娟�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Knownsec Information Technology Co Ltd
Original Assignee
Beijing Knownsec Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Knownsec Information Technology Co Ltd filed Critical Beijing Knownsec Information Technology Co Ltd
Priority to CN201611080168.3A priority Critical patent/CN106776901B/en
Publication of CN106776901A publication Critical patent/CN106776901A/en
Application granted granted Critical
Publication of CN106776901B publication Critical patent/CN106776901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method of extracting data from one or more data sources, each of the one or more data sources comprising a plurality of pieces of data, each piece of data comprising one or more data items in the form of key-value pairs, the data extraction method comprising the steps of: for each data source in the one or more data sources, determining a data type corresponding to each key, and generating a data type table; parsing a piece of data and extracting one or more data items included in the piece of data, for each data item: extracting key-value pairs forming the data item, and determining a data type corresponding to the extracted key from a data type table according to a data source of the data; and verifying the value in the extracted key-value pair by using a data verification method corresponding to the data type, if the verification is passed, the extraction is successful, and the value in the extracted key-value pair is recorded. The invention also discloses a corresponding data extraction device and a corresponding data extraction system.

Description

Data extraction method, device and system
Technical Field
The invention belongs to the technical field of data extraction, and particularly relates to a data extraction method, a data extraction device and a data extraction system.
Background
In the current big data environment, how to accurately extract needed data information from massive data, such as HTTP access logs, internet of things data, and the like, has very important significance in analyzing user behaviors, preferences, habits, and the like, or predicting user behaviors, improving advertisement delivery effects, and the like.
Taking the example of extracting data from a URL (uniform Resource Locator), generally, full-text matching is performed on the data through a predetermined regular expression, and as long as the data is hit, the matched data is extracted, and the type of the data is specified as the type corresponding to the predetermined regular expression. Practice has found that this scheme has the disadvantage of a high error rate. For example, some data, only a part of the content of which conforms to the regular expression rule, can be identified as the corresponding data type and extracted; or, some data are not of the type corresponding to the regular expression, but in a large amount of data, part of the content in the data just conforms to the regular expression rule, and the part of the data can be extracted wrongly.
therefore, there is a need for a data extraction method that can accurately extract data from various data sources and ensure the efficiency of data extraction.
Disclosure of Invention
To this end, the present invention provides a data extraction method, apparatus and system in an attempt to solve or at least alleviate at least one of the problems identified above.
according to one aspect of the present invention, there is provided a method of extracting data from one or more data sources, each of the one or more data sources comprising a plurality of pieces of data, each piece of data comprising one or more data items in the form of key-value pairs, the data extraction method comprising the steps of: for each data source in the one or more data sources, determining a data type corresponding to each key, and generating a data type table; parsing a piece of data and extracting one or more data items included in the piece of data, for each data item: extracting key-value pairs forming the data item, and determining a data type corresponding to the extracted key from a data type table according to a data source of the data; and verifying the value in the extracted key-value pair by using a data verification method corresponding to the data type, if the verification is passed, the extraction is successful, and the value in the extracted key-value pair is recorded.
Optionally, in the data extraction method according to the present invention, the step of generating the data type table includes: for each of the one or more data sources, sampling data to obtain a first number of pieces of data; for each piece of data in the first number of pieces of data, analyzing the data piece by piece and extracting all data items; analyzing the data type of the value corresponding to the key in the key-value pair in each data item by a regular expression and/or a data verification method to be used as the data type corresponding to the key; counting the number of data types corresponding to each key in each data source and the number of values corresponding to the data types; and selecting a data type with the corresponding value number which exceeds a first threshold value from the data types corresponding to each key, determining the data type as the data type corresponding to the key in the data source, and storing the key and the determined data type in the data source in a correlation manner to serve as a data type table.
Optionally, in the data extraction method according to the present invention, the step of sampling data includes, for each of the one or more data sources: extracting a first number of pieces of data in each data source; and/or randomly sampling a first number of pieces of data in each data source; and/or extracting a first number of pieces of data in each data source by time period.
Optionally, in the data extraction method according to the present invention, the number of values corresponding to the data type is a ratio of the number of values corresponding to a certain data type of a certain key to the total number of values of all data types corresponding to the key in the data source.
Optionally, in the data extraction method according to the present invention, the step of verifying the value in the extracted key-value pair by using the data verification method for the data type further includes: the values in the extracted key-value pairs are checked using the regular expression for that data type.
Optionally, in the data extraction method according to the present invention, the method further includes a step of correcting the data type: when a preset condition is met, counting the number of successful extraction and the number of failed extraction of each key in each data source every first preset time, and calculating the extraction success percentage of each key in each data source in the time period; and if the extraction success percentage is lower than a second threshold value, generating an alarm signal to trigger data type correction, and resampling and counting the data type corresponding to the key in the data source.
Optionally, in the data extraction method according to the present invention, the step of remedying the data type further includes: repeating the step of generating the data type table for the latest data every second preset time to generate a new data type table; and according to the new data type table, reselecting the data type of which the corresponding value number exceeds the first threshold value from the data types corresponding to each key as the data type corresponding to the key in the data source so as to execute the subsequent data extraction step.
Optionally, in the data extraction method according to the present invention, the data types include: identity, social account, geographic location information, mobile device identification.
optionally, in the data extraction method according to the present invention, the first predetermined time is one day; the second predetermined time is seven days or one day.
According to a further aspect of the present invention, there is provided an extraction apparatus for extracting data from one or more data sources, each of the one or more data sources comprising a plurality of pieces of data, each piece of data comprising one or more data items in the form of key-value pairs, the data extraction apparatus comprising: the data type analysis module is used for determining the data type corresponding to each key for each data source in one or more data sources and generating a data type table; the data extraction module is suitable for analyzing a piece of data and extracting one or more data items included in the piece of data, and is also suitable for extracting a key-value pair forming the data item for each data item; the data type analysis module is also suitable for determining the data type corresponding to the extracted key from the data type table according to the data source of the piece of data; and the data verification module is suitable for verifying the value in the extracted key-value pair by using a data verification method corresponding to the data type, and if the verification is passed, the extraction is successful, and the value in the extracted key-value pair is recorded.
Optionally, in the data extraction apparatus according to the present invention, the data type analysis module includes: a data sampling unit adapted to sample data for each of one or more data sources to obtain a first number of pieces of data; the data extraction unit is suitable for analyzing the data one by one and extracting all data items for each piece of data in the first number of pieces of data; the data type analysis unit is suitable for analyzing the data type of the value corresponding to the key in the key-value pair in each data item through a regular expression and/or a data verification method to serve as the data type corresponding to the key; the counting unit is suitable for counting the number of data types corresponding to each key and the number of values corresponding to the data types in each data source; the data type analysis unit is further adapted to select a data type of which the corresponding number of values exceeds a first threshold from the data types corresponding to each key, determine the data type as the data type corresponding to the key in the data source, and store the key in the data source and the determined data type in an associated manner as a data type table.
optionally, in the data extraction apparatus according to the present invention, the data sampling unit is further adapted to extract a first number of pieces of data in each data source; and/or further adapted to randomly sample a first number of pieces of data in each data source; and/or further adapted to extract a first number of pieces of data in each data source by time period.
Optionally, in the data extraction apparatus according to the present invention, the number of values corresponding to a data type is a ratio of the number of values corresponding to a certain data type that is a certain key to the total number of values of all data types corresponding to the key in the data source.
Optionally, in the data extraction apparatus according to the present invention, the data verification module is further adapted to verify a value in the extracted key-value pair with a regular expression of the data type.
Optionally, the data extraction device according to the present invention further includes a data type correction module, where the data type correction module is adapted to count, every first predetermined time, the number of successful extractions and the number of failed extractions of each key in each data source, and calculate the percentage of successful extractions of each key in each data source in the time period, when a preset condition is satisfied; and the data type correction module is also suitable for generating an alarm signal when the extraction success percentage is lower than a second threshold value so as to trigger data type correction and carry out resampling statistics on the data type corresponding to the key in the data source.
optionally, in the data extraction apparatus according to the present invention, the data type correction module is further adapted to trigger the data type analysis module every second predetermined time, so that the data type analysis module is adapted to generate a new data type table according to the latest data, and according to the new data type table, reselect, from the data types corresponding to each key, the data type whose corresponding number of values exceeds the first threshold as the data type corresponding to the key in the data source, so as to perform the subsequent data extraction step.
optionally, in the data extraction apparatus according to the present invention, the data types include: identity, social account, geographic location information, mobile device identification.
alternatively, in the data extraction device according to the present invention, the first predetermined time is one day; the second predetermined time is seven days or one day.
According to yet another aspect of the present invention, there is also provided an extraction system for extracting data from one or more data sources, comprising: a data acquisition device adapted to acquire data from one or more data sources; the data extraction apparatus as described above; and a data analysis device adapted to analyze the extracted data.
according to the data extraction scheme, the data type of the Value (Value) of each Key (Key) in each data source is obtained through sampling statistical analysis, and a data type table is generated; when data is extracted, the data type of the key is known, and only the data verification method of the data type is used for verification, so that the data extraction efficiency is improved; and moreover, the accuracy of data extraction is ensured through verification and determination.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a data extraction system 100 according to one embodiment of the invention;
FIG. 2 shows a flow diagram of a data extraction method 200 according to one embodiment of the invention;
FIG. 3 shows a schematic diagram of a data extraction device 120 according to one embodiment of the present invention; and
Fig. 4 shows a schematic diagram of a data extraction device 120 according to a further embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a schematic diagram of a data extraction system 100 according to one embodiment of the invention. As shown in fig. 1, the system 100 includes a data acquisition device 110, a data extraction device 120, and a data analysis device 130. Wherein the data collection device 110 is adapted to collect data from one or more data sources, wherein each of the one or more data sources comprises a plurality of pieces of data, and each piece of data comprises one or more data items in the form of Key-Value pairs (Key-values). The data extraction device 120 is adapted to accurately extract a Value (Value) in a data item from the data collected by the data collection device 110, for example, extract Email, GPS location, social tool account, and the like. The data analysis means 130 is adapted to analyze the data extracted by the data extraction means 120, for example to create a user profile predicting user behavior from a series of data characterizing user features.
Based on the above description of the system 100, how to accurately and efficiently extract data in the present system is the key to implementing the present solution, that is, the operations performed by the data extraction device 120.
the flow of data extraction by the data extraction device 120 will be described in detail below.
referring to fig. 2, a flow diagram of a data extraction method 200 performed in the data extraction device 120 is shown, according to one embodiment of the present invention. The method 200 begins at step S210. In step S210, for each of the one or more data sources collected by the data collection device 110, the data type corresponding to each key is determined, and a data type table is generated.
According to some embodiments, the Value (Value) meaning corresponding to the same Key (Key) is likely to be different in different data sources. For example, in the HTTP access log, HTTP access logs of different websites belong to different data sources, because in the logs of different websites, the meaning of the same Key may be different, taking the following two URLs as an example,
URL1:http://www.xxx.com/index.htm?id=[email protected]&name=test
URL2:http://www.yyy.com/index.html?id=123456&phone=13405671234
in the two URLs, the Key-Value pair is in the form of Key, for URL1, the id field meaning is the identity information of the user, and for URL2, the id field meaning is the digital identifier of the user, and the keys are the same but the meanings are completely different.
Therefore, the extracted key-value pairs are first classified to determine the data type corresponding to each key in each data source.
first, for each of one or more data sources, data is sampled to obtain a first number of pieces of data. Optionally, the step of sampling the data comprises: extracting a first number of pieces of data in each data source; and/or randomly sampling a first number of pieces of data in each data source; and/or extracting a first number of pieces of data in each data source by time period.
Secondly, for each piece of data in the first number of pieces of sampled data, analyzing the data piece by piece and extracting all data items. It should be noted that the present invention is not limited to the method of extracting a data item containing a key-value pair, such as where the data item is in the form of a "key separator value," where the data is separated by separators, the first part being a key, the second part being a value, where the separators may be ":", "═ or" and so forth.
and then, analyzing the data type of the value corresponding to the key in the key-value pair in each data item through a regular expression and/or a data verification method to serve as the data type corresponding to the key. Optionally, the data types include: identification (such as identification number), social account (such as micro signal, QQ number, Email), geographic location information (such as GPS location, city, country), mobile equipment identification (such as IMEI), etc.
according to an embodiment of the invention, if the check code exists in the data of certain data types, if the last bit of the identification number is the check bit, whether the check bit is correct can be verified through calculation. As another example, the regular expression for Email is: ^ a-zA-Z0-9- ] + @ [ a-zA-Z0-9- ] + (\\[ a-zA-Z0-9- ] + $, and whether the key-value pair conforms to the data type of Email can be analyzed through the regular expression.
And finally, counting the number of the data types corresponding to each key in each data source and the number of values corresponding to the data types, selecting the data type of which the corresponding value number accounts for more than a first threshold value from the data types corresponding to each key, determining the data type corresponding to the key in the data source, and storing the key and the determined data type in the data source in a correlated manner to serve as a data type table.
according to the embodiment of the invention, in each statistical data source, the data type corresponding to each key can be represented in the following form:
The "number M" indicates that in the data source X, the Key1 corresponds to data types such as data type a, data type B, … …, unknown data type, and the number of values corresponding to the data type a is M in total.
Calculating a value number proportion of a data type corresponding to each key, wherein the value number proportion corresponding to the data type is a proportion of a value number corresponding to a certain data type of a certain key to the total number of values of all data types corresponding to the key in the data source, when the value number proportion of a certain data type exceeds a first threshold (e.g., 0.8), determining that the data type is the data type corresponding to the key in the data source, and values of other data types corresponding to the key, which may be error values, can be excluded, and generating a data type table as follows:
Subsequently, in step S220, a piece of data is parsed and one or more data items included in the piece of data are extracted.
Subsequently, in step S230, for each data item, the key-value pairs constituting the data item are extracted, and the data type corresponding to the extracted key is determined from the data type table according to the data source of the piece of data. Assuming that the data source of the piece of data is X, the data type corresponding to the Key1 is a from the above data type table.
Subsequently, in step S240, the extracted values in the key-value pairs are verified by using the data verification method corresponding to the data type, and if the verification is passed, the extraction is successful, and the extracted values in the key-value pairs are recorded. Generally, when a data type of a value is analyzed by using a data checking method, a data type list needs to be traversed, whether the value meets the format requirement and the checking requirement of the data type is checked in sequence, and which data type the value belongs to is checked. However, in the method, because the data type corresponding to the extracted key is determined according to the data type table, only the value corresponding to the key needs to be checked whether the value corresponds to the data type, and the efficiency is greatly improved.
According to yet another embodiment of the present invention, the values in the extracted key-value pairs may also be checked using regular expressions of that data type. When the regular expression is used for analyzing the data type of the value, the value is used for matching the regular expression of the data type, and if the IP address data type needs to be analyzed, the value is used for corresponding to the regular expression matching the IP address.
According to the embodiment of the present invention, a data type corresponding to the value may be verified by using a data verification method or a regular expression, or may be verified by using a combination of the two methods, which is not limited in the present invention.
If the data type corresponding to the value is consistent with the data type of the data item determined in step S230 after verification, the verification is passed, indicating that the extraction is successful, and recording the value in the extracted key-value pair. Typically, this value is stored in JSON format, such as:
{"ip":"1.1.1.1","email":"[email protected]"}
according to one implementation, the meaning of the key may change due to upgrading and the like at the data source end, and at this time, the data type needs to be corrected. Generally, the step of remediating the type of data includes:
when a preset condition (the data volume is large enough, such as thousands of times of occurrence of a certain key, and thousands of times of occurrence of the certain key) is met, counting the number of successful extraction and the number of failed extraction of each key in each data source every first preset time (such as 1 day), and calculating the extraction success percentage of each key in each data source in the time period.
And if the extraction success percentage is lower than a second threshold (for example, the value interval of the second threshold is 0.75-0.85), generating an alarm signal to trigger data type correction (automatically or manually by a manager), and performing resampling statistics on the data type corresponding to the key in the data source.
According to one implementation, the above step of generating the data type table (i.e., step S210) may also be repeated for the latest data every second predetermined time (e.g., 1 day or 7 days), and a new data type table is generated.
And according to the new data type table, reselecting the data type of which the corresponding value number exceeds the first threshold value from the data types corresponding to each key as the data type corresponding to the key in the data source so as to finish the subsequent data extraction step.
Referring to the above description, the method 200 obtains the data type of the Value (Value) of each Key (Key) in each data source through sampling statistical analysis, and generates a data type table; when data is extracted, the data type of the key is known, and only the data verification method of the data type is used for verification, so that the data extraction efficiency is improved; and moreover, the accuracy of data extraction is ensured through verification and determination.
Moreover, the meaning of the key possibly changes due to the conditions of upgrading and the like of the data source end, the step of correcting the data type is added, and the accuracy of the data is further improved.
Accordingly, fig. 3 shows a schematic diagram of a data extraction apparatus 120 according to an embodiment of the present invention, as shown in fig. 3, the apparatus 120 comprising: a data type analysis module 122, a data extraction module 124, and a data verification module 126.
For each of the one or more data sources, the data type analysis module 122 determines a data type corresponding to each key, and generates a data type table.
Further, the data type analysis module 122 includes: a data sampling unit 1222, a data extraction unit 1224, a data type analysis unit 1226, and a statistics unit 1228, as shown in fig. 3.
The data sampling unit 1222 is adapted to sample data for each of the one or more data sources to obtain a first number of pieces of data. Optionally, the data sampling unit 1222 is adapted to extract a first number of pieces of data in each data source; and/or randomly sampling a first number of pieces of data in each data source; and/or extracting a first number of pieces of data in each data source by time period.
The data extraction unit 1224 is adapted to parse data item by item and extract all data items for each data item of the first number of data items. The invention is not limited as to the manner in which data items in the form of key-value pairs are extracted.
The data type analysis unit 1226 is adapted to analyze, as the data type corresponding to the key, a data type of a value corresponding to the key in the key-value pairs in each data item through a regular expression and/or a data verification method. Optionally, the data types include: identification (e.g., identification number), social account number (e.g., micro-signal, QQ number, Email), geographic location information (e.g., GPS location, city, country), mobile device identification (e.g., IMEI), etc.
according to an embodiment of the invention, if the check code exists in data of certain data types, if the last bit of the identification number is the check bit, whether the check bit is correct can be verified through calculation. As another example, the regular expression for Email is: ^ a-zA-Z0-9- ] + @ [ a-zA-Z0-9- ] + (\\[ a-zA-Z0-9- ] + $, and whether the key-value pair conforms to the data type of Email can be analyzed through the regular expression. Of course, the data type of the value can be analyzed by combining the above two ways, which is not limited by the present invention.
The statistical unit 1228 is adapted to count the number of data types corresponding to each key in each data source and the number of values corresponding to the data types, where the statistical result is shown in the following table:
The data type analyzing unit 1226 is further adapted to select, from the data types corresponding to each key, a data type whose corresponding value-to-number ratio exceeds a first threshold value, and determine the data type as the data type corresponding to the key in the data source, where the value-to-number ratio corresponding to the data type is a ratio of the value number corresponding to a certain data type of a certain key to the total number of values of all data types corresponding to the key in the data source.
The data type analysis unit 1226 is further adapted to store the key and the determined data type in the data source in association as a data type table, as shown in the following table:
The data extraction module 124 is adapted to parse a piece of data and extract one or more data items included in the piece of data, and is further adapted to extract, for each data item, key-value pairs constituting the data item.
The data type analysis module 122 is further adapted to determine the data type corresponding to the extracted key from the data type table according to the data source of the piece of data. For example, the data type corresponding to the Key2 in the data source X is E, and the data type corresponding to the Key5 in the data source Y is G.
The data verification module 126 is adapted to verify the value in the extracted key-value pair by using a data verification method corresponding to the data type, and if the verification is passed, the extraction is successful, and the value in the extracted key-value pair is recorded.
According to the embodiment of the invention, since the data type corresponding to the extracted key is determined according to the data type table, only the check method for checking whether the value corresponding to the key conforms to the data type is needed.
According to yet another embodiment of the present invention, the values in the extracted key-value pairs may also be checked using regular expressions of that data type. When the regular expression is used for analyzing the data type of the value, the regular expression with the value matching the data type is used, and if the data type of the IP address needs to be analyzed, the regular expression with the value corresponding to the matching IP address is used.
According to the embodiment of the present invention, the data type corresponding to the value may also be verified by combining the data verification method and the regular expression, which is not limited in the present invention.
If the data type corresponding to the value is consistent with the data type of the data item determined by the data type analysis module 122 after verification, the verification is passed, which indicates that the extraction is successful, and the value in the extracted key-value pair is recorded. Typically, the value is stored in JSON format.
Considering that the meaning of the key may change due to upgrading of the data source end, the apparatus 120 further includes a data type correction module 128 in addition to the data type analysis module 122, the data extraction module 124 and the data verification module 126, as shown in fig. 4.
The data type correction module 128 is adapted to count the number of successful extractions and the number of failed extractions of each key in each data source every first predetermined time (e.g., 1 day) when a preset condition is satisfied, and calculate the percentage of successful extractions of each key in each data source in the time period. Alternatively, the preset condition is set to be that the data amount is large enough, such as thousands of times or ten thousands of times of the total number of times a certain key appears.
The data type rectification module 128 is further adapted to generate an alarm signal when the extraction success percentage is lower than a second threshold (e.g., the second threshold has a value ranging from 0.75 to 0.85), so as to trigger data type rectification (automatically or manually by a manager), and to perform resampling statistics on the data type corresponding to the key in the data source.
According to another embodiment of the present invention, the data type correcting module 128 is further adapted to trigger the data type analyzing module 122 every second predetermined time (e.g., 1 day or 7 days), so that the data type analyzing module 122 is adapted to generate a new data type table according to the latest data, and according to the new data type table, reselect a data type with a value number exceeding the first threshold value from the data types corresponding to each key as the data type corresponding to the key in the data source, so as to complete the subsequent data extracting step.
Based on the above description, the present apparatus 120 obtains the data type of the value of each key in each data source through sampling statistical analysis, and generates a data type table; when data is extracted, the data type of the key is known, and only the data verification method of the data type is needed to verify, so that the data extraction efficiency is improved, and the accuracy of data extraction is ensured through verification determination.
moreover, the meaning of the key possibly changes due to upgrading and other situations of the data source end, the function of correcting the data type is added, and the accuracy of the data is further improved.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
A5, the method of any one of A1-4, wherein the step of verifying the value in the extracted key-value pair using the data verification method for the data type further comprises: the values in the extracted key-value pairs are checked using the regular expression for that data type.
A6, the method of any one of A1-5, further comprising the step of correcting the data type: when a preset condition is met, counting the number of successful extraction and the number of failed extraction of each key in each data source every first preset time, and calculating the extraction success percentage of each key in each data source in the time period; and if the extraction success percentage is lower than a second threshold value, generating an alarm signal to trigger data type correction, and performing resampling statistics on the data type corresponding to the key in the data source.
A7, the method of A6, wherein the step of remediating data types further comprises: repeating the step of generating the data type table for the latest data every second preset time to generate a new data type table; and according to the new data type table, reselecting the data type of which the corresponding value number exceeds a first threshold value from the data types corresponding to each key as the data type corresponding to the key in the data source so as to execute the subsequent data extraction step.
A8, the method as in any one of A1-7, wherein the data types include: identity, social account, geographic location information, mobile device identification.
a9, the method of any one of A1-8, wherein the first predetermined time is one day; the second predetermined time is seven days or one day.
B15, the device according to any one of B10-14, further comprising a data type correction module, wherein the data type correction module is suitable for counting the number of successful extraction and the number of failed extraction of each key in each data source every first preset time when a preset condition is met, and calculating the extraction success percentage of each key in each data source in the time period; and the data type correction module is also suitable for generating an alarm signal when the extraction success percentage is lower than a second threshold value so as to trigger data type correction and carry out resampling statistics on the data type corresponding to the key in the data source.
And B16, the device according to B15, wherein the data type correction module is further adapted to trigger the data type analysis module every second predetermined time, so that the data type analysis module is adapted to generate a new data type table according to the latest data, and according to the new data type table, reselect a data type with the number of corresponding values exceeding the first threshold value from the data types corresponding to each key as the data type corresponding to the key in the data source, so as to perform the subsequent data extraction step.
B17, the apparatus as in any one of B10-16, wherein data types include: identity, social account, geographic location information, mobile device identification.
b18, the device of any one of B10-17, wherein the first predetermined time is one day; the second predetermined time is seven days or one day.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (19)

1. A method of extracting data from one or more data sources, each of the one or more data sources comprising a plurality of pieces of data, each piece of data comprising one or more data items in the form of key-value pairs, the data extraction method comprising the steps of:
For each data source in the one or more data sources, determining a data type corresponding to each key, and generating a data type table, wherein the data type table stores the association relationship among the data source, the key and the data type, and the data type is the meaning of a value corresponding to the key;
parsing a piece of data and extracting one or more data items included in the piece of data, for each data item:
Extracting key-value pairs forming the data item, and determining a data type corresponding to the extracted key from the data type table according to a data source of the data; and
and verifying the value in the extracted key-value pair by using a data verification method corresponding to the data type, if the verification is passed, successfully extracting, and recording the value in the extracted key-value pair.
2. The method of claim 1, wherein the step of generating a data type table comprises:
for each of the one or more data sources, sampling data to obtain a first number of pieces of data;
For each piece of data in the first number of pieces of data, analyzing the data piece by piece and extracting all data items;
Analyzing the data type of the value corresponding to the key in the key-value pair in each data item by a regular expression and/or a data verification method to be used as the data type corresponding to the key;
Counting the number of data types corresponding to each key in each data source and the number of values corresponding to the data types; and
Selecting a data type with the corresponding value number ratio exceeding a first threshold value from the data types corresponding to each key, determining the data type as the data type corresponding to the key in the data source, and storing the key and the determined data type in the data source in a correlation manner to serve as a data type table.
3. The method of claim 2, wherein the step of sampling data for each of the one or more data sources comprises:
extracting a first number of pieces of data in each data source; and/or
Randomly sampling a first number of pieces of data in each data source; and/or
A first number of pieces of data are extracted in each data source by time period.
4. The method of claim 2, wherein the number of values corresponding to the data type is a ratio of the number of values corresponding to a data type that is a key to the total number of values corresponding to all data types corresponding to the key in the data source.
5. The method of claim 4, wherein the verifying the value in the extracted key-value pair using the data-verification method for the data type further comprises:
The values in the extracted key-value pairs are checked using the regular expression for that data type.
6. the method of claim 1, further comprising the step of correcting the data type:
When a preset condition is met, counting the number of successful extraction and the number of failed extraction of each key in each data source every first preset time, and calculating the extraction success percentage of each key in each data source in the first preset time; and
And if the extraction success percentage is lower than a second threshold value, generating an alarm signal to trigger data type correction, and performing resampling statistics on the data type corresponding to the key in the data source.
7. the method of claim 6, wherein the step of correcting the data type further comprises:
Repeating the step of generating the data type table for the latest data every second preset time to generate a new data type table;
And according to the new data type table, reselecting the data type of which the corresponding value number exceeds a first threshold value from the data types corresponding to each key as the data type corresponding to the key in the data source so as to execute the subsequent data extraction step.
8. The method of claim 1, wherein the data types comprise: identity, social account, geographic location information, mobile device identification.
9. the method of claim 7, wherein the first predetermined time is one day; the second predetermined time is seven days or one day.
10. an extraction apparatus for extracting data from one or more data sources, each of the one or more data sources including a plurality of pieces of data, each piece of data including one or more data items in the form of key-value pairs, the data extraction apparatus comprising:
The data type analysis module is used for determining a data type corresponding to each key for each data source in the one or more data sources and generating a data type table, wherein the data type table stores the association relation among the data source, the key and the data type, and the data type is the meaning of a value corresponding to the key;
The data extraction module is suitable for analyzing a piece of data and extracting one or more data items included in the piece of data, and is also suitable for extracting a key-value pair forming the data item for each data item;
the data type analysis module is also suitable for determining the data type corresponding to the extracted key from the data type table according to the data source of the piece of data; and
and the data verification module is suitable for verifying the value in the extracted key-value pair by using a data verification method corresponding to the data type, and if the verification is passed, the extraction is successful, and the value in the extracted key-value pair is recorded.
11. The apparatus of claim 10, wherein the data type analysis module comprises:
a data sampling unit adapted to sample data for each of the one or more data sources to obtain a first number of pieces of data;
the data extraction unit is suitable for analyzing the data one by one and extracting all data items for each piece of data in the first number of pieces of data;
The data type analysis unit is suitable for analyzing the data type of the value corresponding to the key in the key-value pair in each data item through a regular expression and/or a data verification method to serve as the data type corresponding to the key;
the counting unit is suitable for counting the number of data types corresponding to each key and the number of values corresponding to the data types in each data source;
The data type analysis unit is further adapted to select a data type of which the corresponding number of values exceeds a first threshold from the data types corresponding to each key, determine the data type as the data type corresponding to the key in the data source, and store the key in the data source and the determined data type in an associated manner as a data type table.
12. The apparatus of claim 11, wherein the data sampling unit is further adapted to extract a first number of pieces of data in each data source; and/or further adapted to randomly sample a first number of pieces of data in each data source; and/or further adapted to extract a first number of pieces of data in each data source by time period.
13. The apparatus of claim 11, wherein the number of values corresponding to the data type is a ratio of the number of values corresponding to a data type that is a key to the total number of values corresponding to all data types corresponding to the key in the data source.
14. The apparatus of claim 10, wherein,
The data verification module is further adapted to verify values in the extracted key-value pairs using a regular expression of the data type.
15. the apparatus of claim 10, further comprising a data type remediation module,
The data type correction module is suitable for counting the number of successful extraction and the number of failed extraction of each key in each data source every first preset time when a preset condition is met, and calculating the extraction success percentage of each key in each data source in the first preset time; and
And the data type correction module is also suitable for generating an alarm signal when the extraction success percentage is lower than a second threshold value so as to trigger data type correction, and resampling and counting the data type corresponding to the key in the data source.
16. The apparatus of claim 15, wherein,
the data type correction module is further suitable for triggering the data type analysis module every second preset time, so that the data type analysis module is suitable for generating a new data type table according to the latest data, and according to the new data type table, reselecting a data type of which the corresponding value number in the data types corresponding to each key exceeds a first threshold value as the data type corresponding to the key in the data source, so as to execute a subsequent data extraction step.
17. The apparatus of claim 10, wherein the data types comprise: identity, social account, geographic location information, mobile device identification.
18. the apparatus of claim 16, wherein the first predetermined time is one day; the second predetermined time is seven days or one day.
19. An extraction system that extracts data from one or more data sources, comprising:
a data acquisition device adapted to acquire data from one or more data sources;
The data extraction apparatus of any one of claims 10-18; and
Data analysis means adapted to analyse the extracted data.
CN201611080168.3A 2016-11-30 2016-11-30 Data extraction method, device and system Active CN106776901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611080168.3A CN106776901B (en) 2016-11-30 2016-11-30 Data extraction method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611080168.3A CN106776901B (en) 2016-11-30 2016-11-30 Data extraction method, device and system

Publications (2)

Publication Number Publication Date
CN106776901A CN106776901A (en) 2017-05-31
CN106776901B true CN106776901B (en) 2019-12-06

Family

ID=58901448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611080168.3A Active CN106776901B (en) 2016-11-30 2016-11-30 Data extraction method, device and system

Country Status (1)

Country Link
CN (1) CN106776901B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894973A (en) * 2017-10-30 2018-04-10 武汉华工赛百数据***有限公司 A kind of method for interchanging data and system based on XML
CN109684374B (en) * 2018-11-28 2021-05-25 海南电网有限责任公司信息通信分公司 Method and device for extracting key value pairs of time series data
CN109710651B (en) * 2018-12-25 2020-11-10 成都四方伟业软件股份有限公司 Data type identification method and device
CN111488260B (en) * 2019-01-29 2023-12-08 华为云计算技术有限公司 Data template acquisition method, device, computer equipment and readable storage medium
CN110390208B (en) * 2019-06-26 2023-02-21 联动优势科技有限公司 Optimized data source access method and device for composite data item label
CN110866557B (en) * 2019-11-12 2022-12-13 贵州医渡云技术有限公司 Data evaluation method and device, storage medium and electronic device
CN111753332A (en) * 2020-06-29 2020-10-09 上海通联金融服务有限公司 Method for completing log desensitization in log writing stage based on sensitive information rule

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870381A (en) * 2012-12-10 2014-06-18 百度在线网络技术(北京)有限公司 Test data generating method and device
CN104809178A (en) * 2015-04-15 2015-07-29 北京科电高技术公司 Write-in method of key/value database memory log
CN104933096A (en) * 2015-05-22 2015-09-23 北京奇虎科技有限公司 Abnormal key recognition method of database, abnormal key recognition device of database and data system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870381A (en) * 2012-12-10 2014-06-18 百度在线网络技术(北京)有限公司 Test data generating method and device
CN104809178A (en) * 2015-04-15 2015-07-29 北京科电高技术公司 Write-in method of key/value database memory log
CN104933096A (en) * 2015-05-22 2015-09-23 北京奇虎科技有限公司 Abnormal key recognition method of database, abnormal key recognition device of database and data system

Also Published As

Publication number Publication date
CN106776901A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106776901B (en) Data extraction method, device and system
US10324989B2 (en) Microblog-based event context acquiring method and system
CN102279786B (en) A kind of method of monitoring and measuring application program effective access amount and device
CN112417439A (en) Account detection method, device, server and storage medium
CN106469261B (en) Identity verification method and device
CN111565171B (en) Abnormal data detection method and device, electronic equipment and storage medium
CN109669795B (en) Crash information processing method and device
US20160277259A1 (en) Traffic quality analysis method and apparatus
CN103336766A (en) Short text garbage identification and modeling method and device
CN109063482B (en) Macro virus identification method, macro virus identification device, storage medium and processor
CN110245273B (en) Method for acquiring APP service feature library and corresponding device
CN108023868B (en) Malicious resource address detection method and device
EP2857987A1 (en) Acquiring method, device and system of user behavior
CN110768875A (en) Application identification method and system based on DNS learning
RU2016105654A (en) METHOD AND DEVICE FOR PROCESSING SHORT MESSAGES
CN109064067B (en) Financial risk operation subject determination method and device based on Internet
CN110019762B (en) Problem positioning method, storage medium and server
US9749352B2 (en) Apparatus and method for collecting harmful website information
CN117171650A (en) Document data processing method, system and medium based on web crawler technology
CN108171053B (en) Rule discovery method and system
CN105224415B (en) For the generation method and device of the code for realizing business task
CN105099996B (en) Website verification method and device
CN104794397B (en) Virus detection method and device
KR101557960B1 (en) Device for selecting core kyword, method for selecting core kyword, and method for providing search service using the same
CN112488562B (en) Service realization method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102

Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd.

Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing

Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant