CN113961789A - Industry information early warning method and device, electronic equipment and storage medium - Google Patents

Industry information early warning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113961789A
CN113961789A CN202010704748.5A CN202010704748A CN113961789A CN 113961789 A CN113961789 A CN 113961789A CN 202010704748 A CN202010704748 A CN 202010704748A CN 113961789 A CN113961789 A CN 113961789A
Authority
CN
China
Prior art keywords
information
industry
industry information
data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010704748.5A
Other languages
Chinese (zh)
Inventor
冯少玉
张伟
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing MetarNet Technologies Co Ltd
Original Assignee
Beijing MetarNet Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing MetarNet Technologies Co Ltd filed Critical Beijing MetarNet Technologies Co Ltd
Priority to CN202010704748.5A priority Critical patent/CN113961789A/en
Publication of CN113961789A publication Critical patent/CN113961789A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides an industry information early warning method, an industry information early warning device, electronic equipment and a storage medium, wherein the industry information early warning method comprises the following steps: the method comprises the steps of obtaining industry information data in a target webpage based on a crawler technology according to a preset strategy, extracting target information if the industry information data are judged to include the target information matched with a preset label, and sending an early warning message to a user based on the target information, wherein the time difference between the release moment of the industry information data and the current moment is smaller than a threshold value. The industry information early warning method, the industry information early warning device, the electronic equipment and the storage medium provided by the embodiment of the invention can acquire the newly issued industry information of the industry through the crawler technology and send the early warning message to the user, thereby helping the user to acquire the latest information more timely.

Description

Industry information early warning method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to an industry information early warning method, an industry information early warning device, electronic equipment and a storage medium.
Background
The existing industry information is usually published on internet web pages, and the data volume of related industry websites is increased at an incredible speed, and the information is huge in quantity and is scattered in form.
In order to collect information related to industry, the prior art scheme usually relies on manual entry by staff. On one hand, the manual input mode needs to rely on manual active browsing of the policy website and collection and input work of information, and on the other hand, the manual input workload is large, which results in high labor cost of enterprises; on the other hand, depending on the defects of missing and untimely updating of information existing in manual browsing of websites, certain hysteresis exists in the aspect of enterprise service support.
Currently, there is no means for actively and timely acquiring industry information and performing early warning, so there is a need to improve the existing industry information acquisition method.
Disclosure of Invention
The embodiment of the invention provides an industry information early warning method, an industry information early warning device, electronic equipment and a storage medium, which are used for actively and timely acquiring industry information and early warning.
The embodiment of the invention provides an industry information early warning method, which comprises the following steps: according to a preset strategy and based on a crawler technology, industry information data in a target webpage are obtained, if the industry information data are judged and obtained to include target information matched with a preset label, the target information is extracted, and an early warning message is sent to a user based on the target information, wherein the time difference between the release moment of the industry information data and the current moment is smaller than a threshold value.
According to an embodiment of the invention, the industry information early warning method for sending the early warning message to the user based on the target information comprises the following steps: adding a label with a matching relation to the target information; and sending an early warning message to the user corresponding to the label.
According to the industry information early warning method of one embodiment of the invention, the early warning message is sent to the user corresponding to the label, and the method comprises the following steps: and sending reminding information and/or target information to the user corresponding to the label.
According to an embodiment of the industry information early warning method of the invention, before judging whether the industry information data includes the target information matched with the preset label, the method further includes: and performing data cleaning on the industry information data, and removing abnormal data in the industry information data.
According to the industry information early warning method, the preset strategy comprises a crawling rule and an extracting rule; correspondingly, if judging that the industry information data comprises target information matched with the preset label, extracting the target information, and before sending an early warning message to a user based on the target information, the method further comprises the following steps: modifying the crawling rules and the extraction rules to update the preset strategy based on the abnormal data; and re-acquiring the industry information data of the target webpage based on the crawler technology according to the updated preset strategy.
According to the industry information early warning method of one embodiment of the invention, the abnormal data in the industry information data are removed, and if the industry information data are judged to include the target information matched with the preset label, the target information is extracted, and the early warning message is sent to the user based on the target information, the method further comprises the following steps: and carrying out data structuring processing on the industry information data after data cleaning.
According to the industry information early warning method of one embodiment of the invention, the preset strategy comprises the URL of the target webpage; correspondingly, according to a preset strategy and based on a crawler technology, before the original data of the target webpage is acquired, the method comprises the following steps: deduplication of the URL of the target web page and/or setting a crawling priority for the URL of the target web page.
The embodiment of the present invention further provides an industry information early warning device, including: the crawling module is used for acquiring industry information data in the target webpage based on a crawler technology according to a preset strategy; the data processing module is used for extracting target information if the industry information data is judged and obtained to include the target information matched with the preset label, and sending an early warning message to a user based on the target information; wherein, the time difference between the release time of the industry information data and the current time is less than a threshold value.
The embodiment of the invention also provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of any one of the industry information early warning methods.
Embodiments of the present invention further provide a non-transitory computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the above-mentioned industry information early warning methods.
According to the industry information early warning method, the industry information early warning device, the electronic equipment and the storage medium, a large amount of industry information newly released by a related website can be actively obtained through a crawler technology, the traditional manual obtaining means is replaced, the defects caused by manual obtaining are avoided, and the early warning message is sent to a user after the newly released industry information is obtained, so that the user can be helped to obtain the latest industry information more timely, and further the enterprise development is guided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating an industry information warning method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an industry information warning device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The industry information warning method according to the embodiment of the invention is described with reference to fig. 1.
As shown in fig. 1, the industry information early warning method provided by the embodiment of the invention includes:
step S101, acquiring industry information data in a target webpage based on a crawler technology according to a preset strategy.
Specifically, before crawling the industry information data of the target webpage, a 'rule base' can be created in advance, seed addresses of various large industry websites and key word contents of industry information are managed and maintained in the 'rule base', and the 'rule base' contents are updated in real time.
The preset policy may include a URL (Uniform Resource Locator) of a crawled target webpage, a crawling rule, an extraction rule, and the like, where the target webpage may be a website for dynamic and information publishing in a related industry, or a website for publishing industry-related policies.
In the embodiment of the invention, the URL of each step of crawling path is configured through XPATH or JSOUP, and the required industry information data in the original file is extracted through XPATH or JSOUP configuration extraction rules.
The industry information may include the latest development dynamics of the industry, industry development data, or latest policy information released by the industry.
It should be noted that, in the embodiment of the present invention, a time difference between the release time of the crawled industry information data and the current time is smaller than a threshold. The threshold value may be determined by an empirical value or a time difference between the time of the last crawl and the current time. Specifically, when crawling the data of the target webpage, the data of the target webpage is screened out based on the data release time of the target webpage, and only the data of which the time difference between the release time and the current time is smaller than the threshold value is crawled. Therefore, the content of the target webpage can be screened out, and only new industry information with relatively short release time is crawled.
The original data volume of the industry information which is crawled is large, and the industry information comprises unstructured data, so that HDFS (Hadoop Distributed File System) can be adopted for storage.
And S102, if the industry information data is judged to include the target information matched with the preset label, extracting the target information, and sending an early warning message to the user based on the target information.
Specifically, a tag corresponding to the industry information is preset, wherein the tag may be a classification category of the industry information or a keyword in which the user is interested.
After the industry information data of the target webpage are crawled, the crawled industry information data are matched with the preset label. And if the industry information data comprises the target information matched with the preset label, extracting the industry and information data comprising the preset label to serve as the target information.
It should be noted that, before matching, semantic analysis may be performed on the crawled content by using a big data technology, and based on the industry information content after the semantic analysis, matching is further performed with the preset tag. Therefore, the target information matched with the preset label can be accurately extracted.
After the target information is extracted, the target information can be stored in a database due to the huge data volume of the target information, and an early warning message is sent to the user based on the extracted target information.
Further, target information can be processed into early warning data which is easy to query and analyze through Spark and stored by Mysql.
It can be understood that, in the embodiment of the present invention, if it is determined that the travel information data does not include the target information matched with the preset tag, the preset policy may be modified, the crawling rule, the extraction rule, the target webpage URL, and the like may be adjusted, and the industry information data in the target webpage may be re-acquired according to the modified preset policy.
According to the industry information early warning method provided by the embodiment of the invention, a large amount of industry information newly released by a related website can be actively acquired through a crawler technology, the traditional manual acquisition means is replaced, the defects caused by manual acquisition are avoided, and the early warning message is sent to a user after the newly released industry information is acquired, so that the user can be helped to acquire the latest industry information in time, and further the enterprise development is guided.
Further, based on the content of the foregoing embodiments, sending an early warning message to a user based on target information includes: and adding a label with a matching relation to the target information.
And after matching is carried out based on the prestored industry label and the crawled industry information data, adding a corresponding label for the successfully matched target information.
And sending an early warning message to the user corresponding to the label.
Specifically, interest tags of different users may be stored in advance, and if the target information includes content corresponding to the interest tags of the users, the early warning message may be sent according to the users corresponding to the tags of the target information.
Because the key points of the industry information interested by different users are different, the industry information interested by the users can be accurately pushed to the users based on the corresponding relation between the prestored industry labels and the users, and therefore the user experience is improved.
Further, based on the content of the foregoing embodiments, sending an early warning message to a user corresponding to a tag includes: and sending reminding information and/or target information to the user corresponding to the label.
Specifically, the sending of the early warning message to the user may be sending a reminding message to the user, and the user may view the new information content of the industry related to the reminding message after receiving the reminding message. The user may also be pushed directly target information of interest to him.
Further, based on the content of the above embodiments, before determining whether the industry information data includes the target information matched with the preset tag, the method further includes: and performing data cleaning on the industry information data, and removing abnormal data in the industry information data.
Specifically, the industry information data stored in the HDFS is cleaned through Spark, and the cleaned industry information data is stored as structured data.
It should be noted that abnormal data in the industry information data can be removed by performing data cleaning, so that the calculation amount in the subsequent extraction of target information can be reduced, the response speed of the industry information early warning method provided by the invention is improved, in addition, the interference of the abnormal data on the subsequent extraction of the target information can be reduced, and the accuracy of the extraction of the target information is increased.
Further, based on the content of each embodiment, the preset policy includes a crawling rule and an extraction rule; correspondingly, if judging that the industry information data comprises target information matched with the preset label, extracting the target information, and before sending an early warning message to a user based on the target information, the method further comprises the following steps: modifying the crawling rules and the extraction rules to update the preset strategy based on the abnormal data; and re-acquiring the industry information data of the target webpage based on the crawler technology according to the updated preset strategy.
Therefore, the crawling rule and the extraction rule are modified based on the abnormal data to update and iterate the preset strategy, and the accuracy of crawling the industry information data can be improved.
Further, based on the content of each embodiment, the method for eliminating abnormal data in the industry information data, extracting target information if it is judged that the industry information data includes target information matched with the preset tag, and sending an early warning message to a user based on the target information further includes: and carrying out data structuring processing on the industry information data after data cleaning.
Because the data volume of the structured data is huge and structured, the structured data is stored by Hive.
Therefore, the unstructured or semi-structured industry information data is subjected to data structured processing, the space occupied by storing the target information into the database can be reduced, and the subsequent retrieval and analysis of the target information are facilitated.
Further, based on the content of the above embodiments, the preset policy includes a URL of the target web page; correspondingly, according to a preset strategy and based on a crawler technology, before the original data of the target webpage is acquired, the method comprises the following steps: deduplication of the URL of the target web page and/or setting a crawling priority for the URL of the target web page.
Therefore, the crawling strategy can be optimized, and unnecessary repeated workload is reduced or important target webpages are preferentially crawled.
In the following, the industrial information early warning apparatus provided by the embodiment of the present invention is described, and the industrial information early warning apparatus described below and the industrial information early warning method described above may be referred to each other.
As shown in fig. 2, the industry information early warning apparatus provided in the embodiment of the present invention includes a crawling module 201, configured to obtain industry information data in a target webpage based on a crawler technology according to a preset policy; the data processing module 202 is configured to extract target information if it is determined that the industry information data includes the target information matched with the preset tag, and send an early warning message to a user based on the target information; wherein, the time difference between the release time of the industry information data and the current time is less than a threshold value.
It should be noted that the crawling module 201 may also be configured to set a URL, a crawling rule, an extraction rule, and the like of the target web page to be crawled. In addition, if the target webpage further includes a webpage link URL, the crawling module 201 may further perform content crawling according to the crawled webpage link URL.
It should be noted that the data processing module 202 may also be configured to preset a tag and add a tag having a matching relationship to the target information.
It should be noted that the data processing module 202 may also be configured to send a reminding message and/or the target message to the user corresponding to the tag.
It should be noted that the data processing module 202 may also be configured to perform data cleaning on the crawled industry information data, and remove abnormal data in the industry information data.
It should be noted that the crawling module 201 may also be configured to modify the crawling rule and the extracting rule based on the abnormal data to update the preset policy, and to reacquire the industry information data of the target webpage based on the crawler technology according to the updated preset policy.
It should be noted that the crawling module 201 may also be configured to deduplicate and/or set a crawling priority for the URL of the target web page.
It should be noted that the data processing module 202 may also be configured to perform data structuring processing on the industry information data after data cleaning.
The industry information early warning device provided by the embodiment of the invention can also comprise a storage module 203, which is used for storing industry information data crawled from a target webpage and industry information data after data structuring processing. Specifically, the storage module 203 can be used to store the industry information data in the database in the HDFS.
The industry information early warning device provided by the embodiment of the invention can actively acquire a large amount of industry information newly released by a related website through the crawler technology, replaces the traditional manual acquisition means, avoids the defects caused by manual acquisition, and sends the early warning message to the user after acquiring the newly released industry information, thereby helping the user to acquire the latest industry information in time and further guiding the development of enterprises.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)310, a communication interface (communication interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call the logic instructions in the memory 330 to execute the industry information warning method in the above embodiment, the method includes: according to a preset strategy and based on a crawler technology, industry information data in a target webpage are obtained, if the industry information data are judged and obtained to include target information matched with a preset label, the target information is extracted, and an early warning message is sent to a user based on the target information, wherein the time difference between the release moment of the industry information data and the current moment is smaller than a threshold value.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the industry information early warning method provided by the above-mentioned method embodiments, where the method includes: according to a preset strategy and based on a crawler technology, industry information data in a target webpage are obtained, if the industry information data are judged and obtained to include target information matched with a preset label, the target information is extracted, and an early warning message is sent to a user based on the target information, wherein the time difference between the release moment of the industry information data and the current moment is smaller than a threshold value.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the industry information early warning method provided in the foregoing embodiments, and the method includes: according to a preset strategy and based on a crawler technology, industry information data in a target webpage are obtained, if the industry information data are judged and obtained to include target information matched with a preset label, the target information is extracted, and an early warning message is sent to a user based on the target information, wherein the time difference between the release moment of the industry information data and the current moment is smaller than a threshold value.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An industry information early warning method is characterized by comprising the following steps:
acquiring industry information data in a target webpage based on a crawler technology according to a preset strategy;
if the industry information data is judged and obtained to include target information matched with a preset label, extracting the target information, and sending an early warning message to a user based on the target information;
and the time difference between the release time of the industry information data and the current time is smaller than a threshold value.
2. The industry information warning method of claim 1, wherein the sending a warning message to a user based on the target information comprises:
adding a label with a matching relation to the target information;
and sending the early warning message to a user corresponding to the label.
3. The industry information warning method of claim 2, wherein the sending the warning message to the user corresponding to the tag comprises:
and sending reminding information and/or the target information to the user corresponding to the label.
4. The industry information early warning method of claim 1, wherein before determining whether the industry information data includes target information matching a preset tag, the method further comprises:
and performing data cleaning on the industry information data, and eliminating abnormal data in the industry information data.
5. The industry information warning method of claim 4, wherein the predetermined policy includes a crawling rule and an extracting rule;
correspondingly, if the industry information data is judged and obtained to include the target information matched with the preset label, the method further comprises the following steps of extracting the target information and sending an early warning message to a user based on the target information:
modifying the crawling rules and the extraction rules to update the preset policy based on the anomaly data;
and re-acquiring the industry information data of the target webpage based on a crawler technology according to the updated preset strategy.
6. The industry information early warning method according to claim 4, wherein the elimination of abnormal data in the industry information data, and the extraction of target information if it is determined that the industry information data includes target information matched with a preset tag, and the sending of early warning information to a user based on the target information further comprise:
and carrying out data structuring processing on the industry information data after data cleaning.
7. The industry information warning method of claim 1, wherein the predetermined policy includes a URL of the target web page;
correspondingly, before the original data of the target webpage is acquired based on the crawler technology according to the preset strategy, the method includes:
and de-duplicating the URL of the target webpage and/or setting the crawling priority of the URL of the target webpage.
8. An industry information early warning device, characterized by, includes:
the crawling module is used for acquiring industry information data in the target webpage based on a crawler technology according to a preset strategy;
the data processing module is used for extracting the target information and sending an early warning message to a user based on the target information if the industry information data is judged and known to include the target information matched with a preset label;
and the time difference between the release time of the industry information data and the current time is smaller than a threshold value.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the industry information warning method as claimed in any one of claims 1 to 7.
10. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the industry information warning method according to any one of claims 1 to 7.
CN202010704748.5A 2020-07-21 2020-07-21 Industry information early warning method and device, electronic equipment and storage medium Pending CN113961789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010704748.5A CN113961789A (en) 2020-07-21 2020-07-21 Industry information early warning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010704748.5A CN113961789A (en) 2020-07-21 2020-07-21 Industry information early warning method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113961789A true CN113961789A (en) 2022-01-21

Family

ID=79459788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010704748.5A Pending CN113961789A (en) 2020-07-21 2020-07-21 Industry information early warning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113961789A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960063A (en) * 2017-04-20 2017-07-18 广州优亚信息技术有限公司 A kind of internet information crawl and commending system for field of inviting outside investment
CN108063869A (en) * 2017-12-14 2018-05-22 维沃移动通信有限公司 A kind of safe early warning method, mobile terminal
CN110837590A (en) * 2019-10-17 2020-02-25 浙江大搜车软件技术有限公司 Information pushing method and device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960063A (en) * 2017-04-20 2017-07-18 广州优亚信息技术有限公司 A kind of internet information crawl and commending system for field of inviting outside investment
CN108063869A (en) * 2017-12-14 2018-05-22 维沃移动通信有限公司 A kind of safe early warning method, mobile terminal
CN110837590A (en) * 2019-10-17 2020-02-25 浙江大搜车软件技术有限公司 Information pushing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107895009B (en) Distributed internet data acquisition method and system
CN105447184B (en) Information extraction method and device
WO2020164276A1 (en) Webpage data crawling method, apparatus and system, and computer-readable storage medium
CN108021651B (en) Network public opinion risk assessment method and device
CN110991171B (en) Sensitive word detection method and device
CN109669925B (en) Management method and device of unstructured data
CN111506795B (en) Method and device for acquiring bid information
Hernández et al. CALA: An unsupervised URL-based web page classification system
WO2013106595A2 (en) Processing store visiting data
CA3120755C (en) Identifying equivalent links on a page
CN104899323A (en) Crawler system used for IDC harmful information monitoring platform
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN111126058A (en) Text information automatic extraction method and device, readable storage medium and electronic equipment
CN111125485A (en) Website URL crawling method based on Scapy
CN112287201A (en) Method, device, medium and electronic equipment for removing duplicate of crawler request
CN108287831B (en) URL classification method and system and data processing method and system
CN110825947B (en) URL deduplication method, device, equipment and computer readable storage medium
CN110457603B (en) User relationship extraction method and device, electronic equipment and readable storage medium
CN107862016A (en) A kind of collocation method of the thematic page
CN113961789A (en) Industry information early warning method and device, electronic equipment and storage medium
CN113742550B (en) Browser-based data acquisition method, device and system
CN116166867A (en) Content filtering method, device, equipment and storage medium for network acquisition
CN113792232B (en) Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product
CN115640439A (en) Method, system and storage medium for network public opinion monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination