CN116166867A

CN116166867A - Content filtering method, device, equipment and storage medium for network acquisition

Info

Publication number: CN116166867A
Application number: CN202310023074.6A
Authority: CN
Inventors: 胡俊华; 李廷威; 肖运龙
Original assignee: Guangdong Infinite Information Technology Co ltd
Current assignee: Guangdong Infinite Information Technology Co ltd
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-05-26

Abstract

The invention provides a content filtering method, a device, equipment and a storage medium for network acquisition, wherein the method comprises the following steps: extracting a target URL link from a URL queue to be crawled, and downloading a target webpage corresponding to the target URL link; extracting target content from the target webpage based on a preset extraction rule; filtering the target content based on a preset filtering rule, and storing the obtained screening content obtained after the filtering process into a database; and executing all the steps circularly until all URL links in the URL queue to be crawled are extracted. The invention can set the filtering rules to intelligently analyze and filter the extracted webpage content, and effectively improves the convenience of collecting and arranging the webpage content.

Description

Content filtering method, device, equipment and storage medium for network acquisition

Technical Field

The present invention relates to the field of content filtering technologies, and in particular, to a content filtering method, device, equipment and storage medium for network acquisition.

Background

In projects such as intelligent semantic knowledge graph, operations such as filtering, screening and sensitive content interception are often required to be performed on content acquired by a network. At present, the collected content is screened by adopting a manual filtering mode in the early stage of the project, and then advertisement, repeated and sensitive information content is removed, so that a large amount of manpower and material resources are consumed, and the effect is not good.

The traditional network acquisition can only dead collect the appointed webpage content according to the acquisition rule, but can not judge whether the content is wanted by oneself, whether advertisement information exists or not and whether sensitive information exists or not, so that the quality of the collected content is very low, and the later classification and arrangement of the content are inconvenient. Therefore, a method for intelligently analyzing and filtering the content collected by the network is needed.

Disclosure of Invention

The invention aims to provide a content filtering method, device, equipment and storage medium for network acquisition, so as to solve the technical problems, thereby being capable of intelligently analyzing and filtering the content acquired by the network and improving the convenience of content acquisition and arrangement.

In order to solve the above technical problems, the present invention provides a content filtering method for network acquisition, including:

extracting a target URL link from a URL queue to be crawled, and downloading a target webpage corresponding to the target URL link;

extracting target content from the target webpage based on a preset extraction rule;

filtering the target content based on a preset filtering rule, and storing the obtained screening content obtained after the filtering process into a database;

and executing all the steps circularly until all URL links in the URL queue to be crawled are extracted.

Further, before the steps are circularly executed until all URL links in the URL queue to be crawled are extracted, the method further includes:

adding the currently crawled URL links to a preset crawled URL queue;

determining the update period of each URL link based on the page type corresponding to each URL link in the crawled URL queue;

and extracting corresponding URL links from the crawled URL link queues according to the updating period of each URL link and adding the corresponding URL links to the URL link queue to be crawled.

analyzing and extracting all sub URL links in the target webpage;

and filtering all sub URL links based on a preset address filtering rule, and adding the target sub URL links obtained after the filtering process into the URL queue to be crawled.

Further, the filtering processing of all sub URL links based on the preset address filtering rule includes:

and filtering blacklist links, advertisement links and links repeated with a preset crawled URL queue in all sub URL links based on a preset address filtering rule to obtain a plurality of target sub URL links.

Further, the filtering processing of the target content based on the preset filtering rule includes:

and detecting the repetition rate of the target content and the content in the database, and filtering the content of which the repetition rate exceeds a preset threshold value in the target content.

and carrying out advertisement word recognition and sensitive word recognition on the target content according to a preset recognition rule, and filtering advertisement words and sensitive words in the target content.

Further, the storing the filtering content obtained after the filtering processing in a database includes:

and binding and storing the corresponding URL links, the text content, the pre-extracted feature labels and the picture resource MD5 abstract into a relational database for the obtained screening content after the filtering process.

The invention also provides a content filtering device for network acquisition, which comprises:

the webpage downloading module is used for extracting a target URL link from the URL queue to be crawled and downloading a target webpage corresponding to the target URL link;

the content extraction module is used for extracting target content from the target webpage based on a preset extraction rule;

the content filtering module is used for filtering the target content based on a preset filtering rule and storing the obtained screening content obtained after the filtering process into a database;

and the circulation execution module is used for circularly executing all the steps until all URL links in the URL queue to be crawled are extracted.

The invention also provides a terminal device comprising a processor and a memory storing a computer program, wherein the processor implements any content filtering method for network acquisition when executing the computer program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the content filtering method for network acquisition of any one of the above.

Compared with the prior art, the invention has the following beneficial effects:

Drawings

FIG. 1 is a schematic flow chart of a content filtering method for network acquisition according to the present invention;

FIG. 2 is a second flow chart of a content filtering method for network acquisition according to the present invention;

fig. 3 is a schematic structural diagram of a content filtering device for network acquisition according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, an embodiment of the present invention provides a content filtering method for network acquisition, which may include the steps of:

s1, extracting a target URL link from a URL queue to be crawled, and downloading a target webpage corresponding to the target URL link;

s2, extracting target content from the target webpage based on a preset extraction rule;

s3, filtering the target content based on a preset filtering rule, and storing the obtained screening content after the filtering process into a database;

and S4, circularly executing all the steps until all URL links in the URL queue to be crawled are extracted.

In the embodiment of the invention, a URL queue to be crawled for storing URL links is preset, a user can add a webpage URL seed link to be crawled into the URL queue, then, step 1 of the method of the invention takes out one URL link from the URL queue to be crawled each time (the URL link can be in a first-in first-out order), the content extraction of step 2 is carried out, and after the content filtration of step 3, the steps 1 to 3 are executed in a recycling mode until all URL links in the URL queue to be crawled are extracted. The invention can set the filtering rules to intelligently analyze and filter the extracted webpage content, and effectively improves the convenience of collecting and arranging the webpage content.

In an embodiment of the present invention, further, before step S4, the method further includes:

adding the currently crawled URL links to a preset crawled URL queue;

It should be noted that a crawled URL queue may be set, after extracting content, URL links corresponding to the web page are added to the crawled URL queue, and an update period of the URL links is determined according to a type of each web page, for example, for a website such as news of a current affair, a shorter update period is configured for the URL links, and then each URL link is added to the crawled URL queue according to the update period configured corresponding to the URL links, so that repeated crawling of the web page with a higher requirement on timeliness is realized. It should be noted that, the update period of each URL link may be set to be no period, that is, a disposable web page that is not repeatedly crawled.

analyzing and extracting all sub URL links in the target webpage;

It should be noted that, for the target web page being crawled, sub URL links may be extracted from the target web page, a corresponding filtering rule may be set to filter the extracted sub URL links, and the filtered sub URL links may be added to the URL queue to be crawled.

In the embodiment of the present invention, further, the filtering processing on all sub URL links based on the preset address filtering rule includes:

It should be noted that, the filtering process of the sub URL link may include blacklist filtering, advertisement link filtering, repeated extraction filtering, and the like.

In the embodiment of the present invention, further, the filtering processing of the target content based on the preset filtering rule includes:

It should be noted that, the feature tag corresponding to the content tag of each web page (the content of one web page usually corresponds to a plurality of tags), the feature tag of the currently crawled web page content is compared with the feature tag of the web page content already extracted from the database, the repetition rate is calculated, and when the repetition rate exceeds the set value, the content corresponding to the tag is filtered.

It should be noted that advertisement words and sensitive word lists can be set according to the requirements and used as the basis for filtering processing after extracting web page contents.

In the embodiment of the present invention, further, the storing the filtered content obtained after the filtering processing in the database includes:

After extracting the web page content and filtering, the content can be bound and stored to a relational database based on the URL link, the text content, the feature tag, the picture resource MD5 abstract and other items, so that the requirements of later classification, arrangement, search query and the like are facilitated.

Based on the above-mentioned scheme, in order to better understand the content filtering method for network acquisition provided by the embodiment of the present invention, the following details are described:

referring to fig. 2, the workflow of the embodiment of the present invention mainly includes: firstly, forming an initial URL set to be crawled according to the URL of a seed webpage, sequentially reading and downloading the URL set to be crawled from the Internet, extracting the content to be acquired according to rules, performing a series of intelligent screening such as de-duplication, advertisement filtering and content filtering on the content, storing the filtered result in a database, putting the currently read URL into a crawled URL queue, analyzing and acquiring a new URL link in the webpage, and putting the new URL link into the set to be crawled.

It should be noted that, in many application scenarios, it is not enough to perform the web crawling only once, and the content thereof needs to be updated periodically. Thus requiring periodic updates to the acquired URL collections or setting up finer update plans for different types of web pages. For example, a website such as news has high requirement on timeliness of the message, so that crawling frequency of the type of website can be accelerated, and updating period can be shortened. With the continuous circulation and reciprocation of the whole process, the data on the Internet is continuously collected, filtered and stored.

By way of example, embodiments of the present invention may be implemented according to the steps of example 1 below:

1. inputting a seed webpage URL, and putting the seed webpage URL into a queue to be crawled;

2. the crawler reads a web page URL and required related request information from a URL queue to be crawled;

3. downloading a target webpage corresponding to the URL, and acquiring target content of the target webpage;

4. extracting relevant information from target content, and obtaining filtering content (de-duplication, advertisement or sensitive content and the like) according to relevant intelligent screening, and removing the filtering content to obtain screening content;

5. updating the filtered content after removing the filtered content into a database for storage, and recording the URL into a crawled URL queue;

6. and analyzing all URL links in the original target webpage, filtering the crawled URL addresses, simultaneously performing blacklist filtering, advertisement link filtering and other processing, and adding the analyzed and filtered target URL addresses into a URL queue to be crawled.

7. And (3) circularly collecting the content of the whole website on the basis until no new URL exists in the queue to be crawled.

8. On the basis of the above, for the URL with relatively high timeliness requirement in the crawled URL queue, the URL can be added to the URL queue to be crawled again after a certain time, so that the URL with relatively high timeliness requirement can be subjected to content extraction and filtering processing according to a certain period.

By way of example, embodiments of the present invention may be implemented according to the steps of example 2 below:

1. constructing a queue capable of storing target webpage URL (Uniform resource positioning system) as a source queue to be crawled by a crawler;

2. taking a target webpage URL as a seed resource locator, and putting the seed resource locator into a source queue to be crawled in the step 1;

3. crawling web page URLs in the crawling source queue, and extracting text content, picture resources and URL links in the filtering pages through an intelligent screening program;

for text content, word segmentation and sensitive word filtering can be performed, classification can be performed, feature labels can be extracted, then the feature labels are compared with the characteristics of the crawled content in a database to perform repetition rate detection, and content with the repetition rate exceeding a specific value is screened out;

for the picture resources, MD5 abstract extraction is carried out, repetition rate detection is carried out by comparing the extracted MD5 abstract with the abstract of the picture crawled by the database, and repeated contents are screened out;

wherein, for URL links, matching with crawled web page URL in database, filtering repeated extracted links, and screening out blacklist links, advertisement links, sub-links under non-target web page URL, etc. (it should be noted that "sub-links under non-target web page URL" in document is mainly links under the same main body of sub-URL and root URL, for example, www.***.com/p/t of links is extracted from currently crawled www.aa.com page, in order to avoid infinite extended crawl, such URL needs to be filtered;

after intelligent screening, the webpage URL is used as a unique feature, and the screened text content, the text content feature tag and the picture resource MD5 abstract are associated and stored in a relational database.

4. Adding the URL links intelligently screened in the step 3 into a source queue to be crawled by the crawler in the step 1, and circularly executing the steps 2, 3 and 4; wherein content crawling for the web page URL as a seed resource locator in step 2 is completed if no crawlable URL links are extracted in step 3.

6. The crawled URL links can be added to the crawled URL queues, corresponding update periods are configured according to the timeliness requirements of the URLs, and the corresponding URLs are re-added to the URL queues to be crawled according to the update periods for circular crawling.

It should be noted that, the embodiment of the invention can perform consistent hash deduplication according to the extracted sub URL, so as to solve the problem of repeated acquisition of the web page; filtering by using the advertising URL, and shielding advertising content information; sensitive content filtering is performed by using word segmentation, and URLs containing sensitive content are intercepted. The intelligent filtering method provided by the embodiment of the invention can rapidly filter the unwanted content in the acquisition process, and effectively improves the convenience of webpage content acquisition and sorting.

It should be noted that, for simplicity of description, the above method or flow embodiments are all described as a series of combinations of acts, but it should be understood by those skilled in the art that the embodiments of the present invention are not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are all alternative embodiments and that the actions involved are not necessarily required for the embodiments of the present invention.

Referring to fig. 3, an embodiment of the present invention further provides a content filtering device for network acquisition, including:

the webpage downloading module 1 is used for extracting a target URL link from a URL queue to be crawled and downloading a target webpage corresponding to the target URL link;

the content extraction module 2 is used for extracting target content from the target webpage based on a preset extraction rule;

the content filtering module 3 is used for filtering the target content based on a preset filtering rule, and storing the obtained screening content obtained after the filtering process into a database;

and the circulation execution module 4 is used for executing all the steps circularly until all the URL links in the URL queue to be crawled are extracted.

It can be understood that the embodiment of the device item corresponds to the embodiment of the method item of the present invention, and the content filtering device for network acquisition provided by the embodiment of the present invention may implement the content filtering method for network acquisition provided by any one of the embodiments of the method item of the present invention.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

It will be clear to those skilled in the art that, for convenience and brevity, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The terminal device may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor, a memory.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the terminal device, and which connects various parts of the entire terminal device using various interfaces and lines.

The memory may be used to store the computer program, and the processor may implement various functions of the terminal device by running or executing the computer program stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The storage medium is a computer readable storage medium, and the computer program is stored in the computer readable storage medium, and when executed by a processor, the computer program can implement the steps of the above-mentioned method embodiments. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. A content filtering method for network acquisition, comprising:

2. The content filtering method for network acquisition according to claim 1, wherein before the steps are circularly performed until all URL links in the URL queue to be crawled are extracted, further comprising:

adding the currently crawled URL links to a preset crawled URL queue;

3. The content filtering method for network acquisition according to claim 1, wherein before the steps are circularly performed until all URL links in the URL queue to be crawled are extracted, further comprising:

analyzing and extracting all sub URL links in the target webpage;

4. The content filtering method for network acquisition according to claim 3, wherein the filtering all sub URL links based on a preset address filtering rule includes:

5. The content filtering method for network acquisition according to claim 1, wherein the filtering the target content based on a preset filtering rule includes:

6. The content filtering method for network acquisition according to claim 1, wherein the filtering the target content based on a preset filtering rule includes:

7. The content filtering method for network acquisition according to claim 1, wherein storing the filtered content obtained after the filtering process in a database includes:

8. A content filtering apparatus for network acquisition, comprising:

9. A terminal device comprising a processor and a memory storing a computer program, characterized in that the processor implements the content filtering method for network acquisition according to any one of claims 1 to 7 when executing the computer program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the content filtering method for network acquisition according to any of claims 1 to 7.