CN111460253A - Internet data capture method suitable for big data analysis - Google Patents

Internet data capture method suitable for big data analysis Download PDF

Info

Publication number
CN111460253A
CN111460253A CN202010212831.0A CN202010212831A CN111460253A CN 111460253 A CN111460253 A CN 111460253A CN 202010212831 A CN202010212831 A CN 202010212831A CN 111460253 A CN111460253 A CN 111460253A
Authority
CN
China
Prior art keywords
data
information
internet
screening
method suitable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010212831.0A
Other languages
Chinese (zh)
Inventor
相辉
张永力
苏睿清
张弘媛
蔡鹏飞
张静
卢焱
杨青卓
李昊兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Hebei Public Bidding Co ltd
State Grid Corp of China SGCC
Materials Branch of State Grid Hebei Electric Power Co Ltd
Original Assignee
State Grid Hebei Public Bidding Co ltd
State Grid Corp of China SGCC
Materials Branch of State Grid Hebei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Hebei Public Bidding Co ltd, State Grid Corp of China SGCC, Materials Branch of State Grid Hebei Electric Power Co Ltd filed Critical State Grid Hebei Public Bidding Co ltd
Priority to CN202010212831.0A priority Critical patent/CN111460253A/en
Publication of CN111460253A publication Critical patent/CN111460253A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an internet data capturing method suitable for big data analysis, which comprises the following steps: s1, the data acquisition terminal generates application data, and the data acquisition terminal transmits the data generated after various application software platforms are used and websites are previewed to the server and the cloud database of the corresponding manufacturer, or stores the data in the third party cloud database; s2, screening and analyzing the application data stored in the server and the cloud database, comparing the data with the information stored in the general information base by the data screening and analyzing, and screening as a first step of program; and S3, generating three judgment results after selection, wherein the first judgment result is that the information which is compared with the general information base data and is not doubtful is directly fed back to the client through the platform and the website. The invention carries out program screening and manual screening on the stored data, thereby improving the value of the data, reducing the flow of fake information and being beneficial to the healthy development of the industry.

Description

Internet data capture method suitable for big data analysis
Technical Field
The invention relates to the technical field of internet big data, in particular to an internet data capture method suitable for big data analysis.
Background
Big data is a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. There are three levels of big data technology, the first is the underlying technology for data management. The second level is the artificial intelligence technique. Three major trends of the current development of internet big data, the first trend is 'personalization'; the second trend is "intelligent"; the third trend is "industrialization".
When the existing internet big data is captured, the data origin is generally the combination of the internet and the internet of things, so the generated big data is extremely complex, and much big data is useless. However, the existing internet big data is mainly used for better serving the industry when being captured, and more accurate products are recommended to customers, however, under the influence of massive useless and interference data, correct pushing and safe capturing of the big data can be seriously influenced, so that the healthy development of the industry is not facilitated, and therefore, the safe and efficient big data capturing method is just lacked to solve the problems.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides an internet data capture method suitable for big data analysis.
In order to achieve the purpose, the invention adopts the following technical scheme:
the internet data capture method suitable for big data analysis comprises the following steps:
s1, the data acquisition terminal generates application data, and the data acquisition terminal transmits the data generated after various application software platforms are used and websites are previewed to the server and the cloud database of the corresponding manufacturer, or stores the data in the third party cloud database;
s2, screening and analyzing the application data stored in the server and the cloud database, comparing the data with the information stored in the general information base by the data screening and analyzing, and screening as a first step of program;
and S3, generating three judgment results after selection, wherein the first judgment result is that the information without doubt is directly fed back to the client through the platform and the website after being compared with the general information base data, the second judgment result is that the information with doubt is present, the next step of manual screening is carried out, and the third judgment result is that the information obviously does not meet the relevant regulations and standards, and then sending corresponding warning or directly alarming.
Preferably, the data acquisition terminal comprises an internet of things terminal, a computer terminal and a handheld terminal, the internet of things terminal mainly generates position information, state information and equipment information when in use, the computer terminal mainly generates software platform information and IP address information when in use, and the handheld terminal mainly generates the software platform information, the position information, the state information and the IP address information when in use.
Preferably, the general information base comprises a fraud information base, a hazard information base and a forbidden image information base, and the general information base is networked by a computer and updates and adds information appearing on the world in time.
Preferably, the program filtering includes two categories of keyword retrieval and sensitive image retrieval.
Preferably, the data acquisition terminal performs data analysis and retrieval by adopting a page parser, a crawling strategy search technology, a main body crawler technology, a link correlation estimation technology, a content correlation calculation technology, a dynamic Web page acquisition technology, a dynamic page classification technology, a microblog information content acquisition technology and a deep Web data acquisition technology.
Preferably, the data acquisition terminal has clear distinction, classifies the data sources, and determines target data and root data, wherein the target data is sourced from individual clients, the root data is sourced from enterprise clients, and the data of the individual clients is fed back to the enterprise clients during data feedback.
Preferably, when the manual screening is performed, a professional trained platform official worker previews the information with doubt, the previewed and judged information without doubt can be directly fed back to the client, and if the judged information does not meet the relevant regulations and standards, a corresponding warning or a direct warning is sent to the client.
According to the invention, the data of the data acquisition terminal is acquired and stored by using the page analyzer, the crawling strategy search technology, the main body crawler technology, the link correlation estimation technology, the content correlation calculation technology, the dynamic Web page acquisition technology, the dynamic page classification technology, the microblog information content acquisition technology and the deep Web data acquisition technology, and the stored data is subjected to program screening and manual screening, so that the data value is improved, the flow of fake and fraudulent information is reduced, and the healthy development of the industry is facilitated.
Drawings
FIG. 1 is a data capture feedback flow chart of the big data analysis Internet data capture method provided by the invention;
fig. 2 is a data analysis flow chart of the internet data capture method suitable for big data analysis according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments.
In the present embodiment, it is proposed
The internet data capture method suitable for big data analysis comprises the following steps:
s1, the data acquisition terminal generates application data, and the data acquisition terminal transmits the data generated after various application software platforms are used and websites are previewed to the server and the cloud database of the corresponding manufacturer, or stores the data in the third party cloud database;
s2, screening and analyzing the application data stored in the server and the cloud database, comparing the data with the information stored in the general information base by the data screening and analyzing, and screening as a first step of program;
and S3, generating three judgment results after selection, wherein the first judgment result is that the information without doubt is directly fed back to the client through the platform and the website after being compared with the general information base data, the second judgment result is that the information with doubt is present, the next step of manual screening is carried out, and the third judgment result is that the information obviously does not meet the relevant regulations and standards, and then sending corresponding warning or directly alarming.
In this embodiment, the data acquisition terminal includes an internet of things terminal, a computer terminal and a handheld terminal, where the internet of things terminal mainly generates location information, status information and device information when in use, the computer terminal mainly generates software platform information and IP address information when in use, and the handheld terminal mainly generates software platform information, location information, status information and IP address information when in use.
In this embodiment, the general information library includes a fraud information library, a hazard information library and a prohibited image information library, and the general information library is networked by using a computer to update and add information appearing on the world in time.
In this embodiment, the program filtering includes two categories, i.e., keyword search and sensitive image search.
In this embodiment, the data acquisition terminal performs data analysis and retrieval by using a page parser, a crawling policy search technology, a main crawler technology, a link relevance estimation technology, a content relevance calculation technology, a dynamic Web page acquisition technology, a dynamic page classification technology, a microblog information content acquisition technology, and a deep Web data acquisition technology.
In this embodiment, the data acquisition terminal has a clear distinction to classify the data sources, and determine target data and root data, where the target data is sourced from individual clients, and the root data is sourced from enterprise clients, and when data is fed back, the data of the individual clients is fed back to the enterprise clients.
In this embodiment, when the manual screening is performed, a professional trained platform official worker previews the information with doubt, and if the previewing judges that the information without doubt is not included, the information can be directly fed back to the client, and if the information does not meet the relevant regulations and standards, a corresponding warning or a direct warning is given to the client.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (7)

1. The internet data capture method suitable for big data analysis is characterized by comprising the following steps:
s1, the data acquisition terminal generates application data, and the data acquisition terminal transmits the data generated after various application software platforms are used and websites are previewed to the server and the cloud database of the corresponding manufacturer, or stores the data in the third party cloud database;
s2, screening and analyzing the application data stored in the server and the cloud database, comparing the data with the information stored in the general information base by the data screening and analyzing, and screening as a first step of program;
and S3, generating three judgment results after selection, wherein the first judgment result is that information which is not doubtful after being compared with the general information base data is directly fed back to a client through a platform and a website, the second judgment result is that information which is doubtful enters manual screening, and the third judgment result is that the information obviously does not accord with relevant laws and regulations or standards, and then sending corresponding warning or directly alarming.
2. The internet data capture method suitable for big data analysis according to claim 1, wherein the data acquisition terminal comprises an internet of things terminal, a computer terminal and a handheld terminal, the internet of things terminal generates location information, status information and device information when in use, the computer terminal generates software platform information and IP address information when in use, and the handheld terminal generates software platform information, location information, status information and IP address information when in use.
3. The internet data capture method suitable for big data analysis according to claim 1, wherein the general information bases comprise a fraud information base, a hazard statement information base and a prohibited image information base, and the general information bases are networked by a computer to update and add information appearing on the world in time.
4. The internet data capture method suitable for big data analysis according to claim 1, wherein the program filtering includes two categories of keyword search and sensitive image search.
5. The internet data capture method suitable for big data analysis according to claim 1, wherein the data acquisition terminal performs data analysis and retrieval by using a page parser, a crawling policy search technique, a main crawler technique, a link correlation estimation technique, a content correlation calculation technique, a dynamic Web page acquisition technique, a dynamic page classification technique, a micro-blog information content acquisition technique, and a deep Web data acquisition technique.
6. The internet data capture method suitable for big data analysis according to claim 1, wherein the data acquisition terminal has clear distinction to classify the data sources and determine the target data and the root data, wherein the target data is from individual clients and the root data is from enterprise clients, and the data of the individual clients is fed back to the enterprise clients during data feedback.
7. The internet data capture method suitable for big data analysis according to claim 1, wherein during manual screening, professional trained platform officials preview suspicious information, and if the information is judged to be free from suspicious by preview, the information is directly fed back to the client, and if the information is judged to be not in compliance with relevant regulations and standards, corresponding warning or direct alarm is given.
CN202010212831.0A 2020-03-24 2020-03-24 Internet data capture method suitable for big data analysis Pending CN111460253A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010212831.0A CN111460253A (en) 2020-03-24 2020-03-24 Internet data capture method suitable for big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010212831.0A CN111460253A (en) 2020-03-24 2020-03-24 Internet data capture method suitable for big data analysis

Publications (1)

Publication Number Publication Date
CN111460253A true CN111460253A (en) 2020-07-28

Family

ID=71685700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010212831.0A Pending CN111460253A (en) 2020-03-24 2020-03-24 Internet data capture method suitable for big data analysis

Country Status (1)

Country Link
CN (1) CN111460253A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064947A (en) * 2021-04-08 2021-07-02 深圳石方数链科技有限公司 Customer data protection system based on customer management system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102208992A (en) * 2010-06-13 2011-10-05 天津海量信息技术有限公司 Internet-facing filtration system of unhealthy information and method thereof
CN104063448A (en) * 2014-06-18 2014-09-24 华东师范大学 Distributed type microblog data capturing system related to field of videos
GB201507530D0 (en) * 2015-05-01 2015-06-17 Salesoptimize Ltd Computer-implemented methods of website analysis
CN105117484A (en) * 2015-09-17 2015-12-02 广州银讯信息科技有限公司 Internet public opinion monitoring method and system
CN105893368A (en) * 2014-11-19 2016-08-24 北京航天长峰科技工业集团有限公司 Multilingual online public opinion analysis method
CN106960063A (en) * 2017-04-20 2017-07-18 广州优亚信息技术有限公司 A kind of internet information crawl and commending system for field of inviting outside investment
CN109063054A (en) * 2018-07-19 2018-12-21 天津迈基生物科技有限公司 A kind of machine learning and big data processing system
CN109255063A (en) * 2018-08-01 2019-01-22 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus crawling web page contents
CN110321471A (en) * 2019-04-19 2019-10-11 四川政资汇智能科技有限公司 A kind of internet techno-financial intelligent Matching method based on the convergence of policy resource

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102208992A (en) * 2010-06-13 2011-10-05 天津海量信息技术有限公司 Internet-facing filtration system of unhealthy information and method thereof
CN104063448A (en) * 2014-06-18 2014-09-24 华东师范大学 Distributed type microblog data capturing system related to field of videos
CN105893368A (en) * 2014-11-19 2016-08-24 北京航天长峰科技工业集团有限公司 Multilingual online public opinion analysis method
GB201507530D0 (en) * 2015-05-01 2015-06-17 Salesoptimize Ltd Computer-implemented methods of website analysis
CN105117484A (en) * 2015-09-17 2015-12-02 广州银讯信息科技有限公司 Internet public opinion monitoring method and system
CN106960063A (en) * 2017-04-20 2017-07-18 广州优亚信息技术有限公司 A kind of internet information crawl and commending system for field of inviting outside investment
CN109063054A (en) * 2018-07-19 2018-12-21 天津迈基生物科技有限公司 A kind of machine learning and big data processing system
CN109255063A (en) * 2018-08-01 2019-01-22 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus crawling web page contents
CN110321471A (en) * 2019-04-19 2019-10-11 四川政资汇智能科技有限公司 A kind of internet techno-financial intelligent Matching method based on the convergence of policy resource

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064947A (en) * 2021-04-08 2021-07-02 深圳石方数链科技有限公司 Customer data protection system based on customer management system

Similar Documents

Publication Publication Date Title
CN107888574B (en) Method, server and storage medium for detecting database risk
CN111245793A (en) Method and device for analyzing abnormity of network data
CN113098870A (en) Phishing detection method and device, electronic equipment and storage medium
CN108023868B (en) Malicious resource address detection method and device
CN109347808B (en) Safety analysis method based on user group behavior activity
CN115134099B (en) Network attack behavior analysis method and device based on full flow
CN109756467B (en) Phishing website identification method and device
CN108229170B (en) Software analysis method and apparatus using big data and neural network
KR101692982B1 (en) Automatic access control system of detecting threat using log analysis and automatic feature learning
CN113409555B (en) Real-time alarm linkage method and system based on Internet of things
CN105516128A (en) Detecting method and device of Web attack
CN108337269A (en) A kind of WebShell detection methods
CN113572757B (en) Server access risk monitoring method and device
CN109657119A (en) A kind of web crawlers detection method based on access log IP analysis
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN115982762A (en) Big data based data security leakage-proof management method, system and medium
CN113918938A (en) User entity behavior analysis method and system of continuous immune safety system
CN111460253A (en) Internet data capture method suitable for big data analysis
CN117609992A (en) Data disclosure detection method, device and storage medium
CN112528325B (en) Data information security processing method and system
CN113923037B (en) Anomaly detection optimization device, method and system based on trusted computing
CN113688346A (en) Illegal website identification method, device, equipment and storage medium
CN105205134B (en) Identify that user clicks the method and device of access website behavior
CN114389875A (en) Man-machine behavior detection method, system, equipment and medium
CN113132340B (en) Phishing website identification method based on vision and host characteristics and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Xiang Hui

Inventor after: Zhang Yongli

Inventor after: Su Ruiqing

Inventor after: Zhang Hongyuan

Inventor after: Cai Pengfei

Inventor after: Zhang Jing

Inventor after: Lu Yan

Inventor after: Yang Qingzhuo

Inventor after: Li Haolan

Inventor before: Xiang Hui

Inventor before: Zhang Yongli

Inventor before: Su Ruiqing

Inventor before: Zhang Hongyuan

Inventor before: Cai Pengfei

Inventor before: Zhang Jing

Inventor before: Lu Yan

Inventor before: Yang Qingzhuo

Inventor before: Li Haolan

CB03 Change of inventor or designer information
RJ01 Rejection of invention patent application after publication

Application publication date: 20200728

RJ01 Rejection of invention patent application after publication