CN109829092B

CN109829092B - Method for directionally monitoring webpage

Info

Publication number: CN109829092B
Application number: CN201811604429.6A
Authority: CN
Inventors: 孙再连; 吴谋荣; 苏淮
Original assignee: Xiamen Etom Software Technology Co ltd
Current assignee: Xiamen Yitong Intelligent Technology Group Co ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2021-05-28
Anticipated expiration: 2038-12-26
Also published as: CN109829092A

Abstract

The invention discloses a method for directionally monitoring a webpage, which is used for framing contents on the webpage, capturing each content in the framing and giving related information of each content, wherein the related information comprises a title, an abstract, a website, a webpage text and the like. According to the method, the webpage is framed, the related information of the framed content is directly obtained, the operation is simple and rapid, the content which is the same as the framed content and is not framed on the webpage can be automatically obtained, multiple framing of the webpage by a user is avoided, and the working efficiency of the user is improved. The method can also record historical frame selection operation, judge whether the corresponding contents of different frame selections are consistent, provide the contents crawled by the historical frame selection operation and related information thereof when the contents are consistent, avoid repeatedly crawling the webpage and wasting resources, and simultaneously add a manual supervision mechanism and a manual judgment mechanism, thereby improving the accuracy and reliability of the method.

Description

Method for directionally monitoring webpage

Technical Field

The invention relates to the technical field of webpage monitoring, in particular to a method for directionally monitoring a webpage.

Background

In the information explosion era, how to quickly and accurately acquire data in the face of mass data on the internet at present has become a strong appeal for individuals and enterprises.

Crawler tool and data acquisition product on the market are various at present, and some products use is simple directly perceived, but commonality, maintainability and accurate nature have more problem yet, and the concrete expression is as follows:

1. the data source is based on a strategy scheme customized by the product, and deep data customization cannot be completed;

2. the configuration process is very complex, has high requirements on personnel quality and can be completed by professional personnel;

3. a specific analyzer can only extract specific pages, and if the specific analyzer is to extract different columns of a plurality of different dynamic websites, a plurality of analyzers must be written, so that the complexity of the system is increased;

4. when some features of the target page are changed, such as the page link or the page layout is modified, the corresponding analyzer must make corresponding modifications, and if the target page is too many or changed too much, the difficulty of modifying the analyzer will be increased.

Therefore, based on the above situations, there is an urgent need in the market for a web page monitoring method that combines machine learning and user behavior trajectory monitoring technologies, has high versatility and good maintainability, and can make extraction of text data simpler and more accurate by using a natural language processing technology.

Disclosure of Invention

The invention provides a method for directionally monitoring a webpage, which aims to solve the technical problems and is characterized in that the method is simple and quick to operate and comprises the steps of framing contents on the webpage, capturing each content in the framing and giving related information of each content, wherein the related information comprises a title, an abstract, a website, a webpage text and the like.

Optionally, the frame selection adopts a screen capture positioning mode.

Optionally, the positioning information is obtained according to the user frame selection area, and then the positions of all elements in the webpage are compared with the positions of the user frame content, so as to preliminarily screen out the matching content, which is the content that the user wants to know.

Optionally, the positioning manner of the frame selection area may be: when in frame selection, the coordinates of the initial point of the frame selection are recorded as (X1, Y1), the coordinates of the end point are recorded as (X2, Y2), the initial point and the end point enclose a rectangular frame selection area, the coordinates of the frame selection area are superposed with the vertical and horizontal displacements caused by the user when the user pulls the webpage scroll bar, so as to obtain the absolute coordinate values of the initial point and the end point of the frame selection area, which are respectively (X1+ ScrollLeft, Y1+ ScrollTop) and (X2+ ScrollLeft, Y2+ ScrollTop), wherein the ScrollLeft is the value of the horizontal pulling of the webpage, and the ScrollTop is the value of the vertical pulling of the webpage.

And acquiring the coordinate of each content in the webpage, and recording the coordinate of any content A as (Xa, Ya), wherein the length of the content A is W, and the width of the content A is H.

And judging whether the element A on the webpage is contained in the area selected by the user frame by adopting an exclusion method, judging that the content A is not in the frame selection area when Xa + W < X1+ ScrollLeft, or X2+ ScrollLeft < Xa, or Ya + H < Y1+ ScrollTop, or Y2+ ScrollTop < Ya, or Xa < X1+ ScrollLeft, Ya < Y1+ ScrollTop, and Xa + W > X2+ ScrollLeft, and Ya + H > Y2+ ScrollTop, or otherwise, judging that the content A is in the frame selection area, and repeating the steps to obtain all the contents selected by the frame in the webpage.

Optionally, classifying and labeling the webpage source codes through machine learning; the track simulation of user frame selection is carried out on webpage contents concerned by the user through machine learning, user operation simulation, intelligent bid alignment and drill-down crawling, and the webpage is deeply mined, so that the contents which are not frame-selected by the user and are needed by the user in the webpage crawling.

Optionally, collecting all contents in the boxed region to form a set B, obtaining a first element B1 and a last element bn from the set B, analyzing the first element B1 and the last element bn to obtain a common parent node of the first element B1 and the last element bn, if the parent node hierarchies of the first element B1 and the last element bn are different, considering that the two elements are not of the same type, discarding the last element bn, then obtaining an element bn-1 again, analyzing the common parent node of the first element B1 and the element bn-1, and so on until an element bm which has a common parent node with B1 is found; analyzing whether the patterns of the b1 and the bm are the same, if the patterns of the b1 and the bm are different, discarding the bm element to obtain a bm-1 element again, analyzing the patterns of the b1 and the bm-1 element again, and so on until finding the elements b1 and bz which have a common pattern; respectively obtaining all father nodes of a b1 element and a bz element as list1 and listz, and comparing the same node with the maximum level of list1 and listz as node1, wherein the node1 is the nearest common father node of the b1 element and the bz element; and searching for an element having a common style with the b1 element by using the node1 node, and acquiring a set Y { b1, … …, bz }, wherein the set Y is the content required to be acquired by the user.

Optionally, comparing the real-time frame selection area of the user with the historical frame selection area or the historical frame selection areas of other users, and judging whether the real-time frame selection areas belong to the same frame selection area; when the judgment result shows that the two items are the same, acquiring real-time related information of the framed content according to historical framing; and when the judgment result is that the content is not the same, crawling the content which is not selected by the user in the frame and is needed by the user and the related information of the content. Through the judgment of the frame selection area, the same frame selection is not required to be crawled, the crawling frequency is reduced, and the resource waste caused by repeated crawling is avoided.

Optionally, the method for determining the same frame selection area includes constructing an SVM classifier by obtaining coordinates of a start point and coordinates of an end point (X1, Y1, X2, Y2) of frame selection by a user as input parameters, classifying the frame selection area according to the locations of all contents between the start point and the end point in the whole webpage, and determining the same area according to the classification result.

Optionally, the classifier adds a user supervision mechanism, and the user judges whether the classification result is the content concerned by the user, and adds the judgment result into the training set for the next training; the training set is cleaned and trained regularly, noise generated due to misoperation of a user is combined, a correct judgment result is finally stored in the training set, and the training result is called when the classifier is used, so that resource waste caused by repeated training is avoided.

Optionally, machine learning, supervised learning and reinforcement learning are continuously performed on the user directed frame selection behavior according to the identity characteristics of the user, so that automatic recommendation frame selection is intelligently performed on the content concerned by the user. The automatic recommendation frame selection is that a Bayesian classifier is used for carrying out classification training on user behavior data samples, when the recommendation frame selection meets the user requirements, the classifier automatically stores the user behaviors and recommendation results into a data sample library, when the recommendation frame selection does not meet the user requirements, a program automatically skips to a manual frame selection interface of a user, and simultaneously learns the user behaviors and the frame selection results, so that the accuracy of automatic recommendation frame selection of the user by the classifier is improved.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

1. the webpage is selected through the frame, the related information of the frame content is directly obtained, and the operation is simple and rapid;

2. the method and the device can automatically acquire the content which is the same as the frame selection content and is not selected by the frame selection on the webpage, avoid the repeated frame selection on the webpage by a user, and improve the working efficiency of the user;

3. the method can record historical frame selection operation, judge whether contents corresponding to different frame selections are consistent, and provide the contents acquired by the historical frame selection operation and related information thereof when the contents are consistent;

4. and a manual supervision mechanism and a manual judgment mechanism are added, so that the accuracy and the reliability of the method are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Wherein:

FIG. 1 is a schematic flow chart illustrating a first embodiment of a method for directionally monitoring a web page according to the present invention;

FIG. 2 is a schematic flowchart of a third embodiment of a method for directionally monitoring a web page according to the present invention;

FIG. 3 is a flowchart illustrating a fourth embodiment of a method for directionally monitoring a web page according to the present invention;

FIG. 4 is a schematic diagram illustrating a fourth step of an embodiment of a method for directionally monitoring a web page according to the present invention;

fig. 5 is a schematic diagram of the steps of a five-user determination mechanism according to an embodiment of the method for directionally monitoring a web page.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The first embodiment is as follows: referring to fig. 1, a method for directionally monitoring a web page includes selecting a frame of content on the web page in a screen capture positioning manner, capturing each content selected in the frame, and providing related information of each content, where the related information includes a title, an abstract, a website, a web page text, and the like, and the operation is simple and fast.

Before frame selection, a user positions a target website to be captured, a whole picture is generated on a webpage, the user performs frame selection on the picture, screenshot is performed, data and content to be captured are selected, namely the picture identifies three events of mouse left key clicking, mouse moving and mouse button lifting of the user, so that a frame selection area is obtained, data and content expected by the user are captured, and the captured data are analyzed and displayed to the user.

In the second embodiment, on the basis of the first embodiment, in order to ensure that the obtained content is the content required by the user, the positioning information is obtained according to the user frame selection area, and then the positions of all elements in the webpage are compared with the positions of the user frame content, so that the matching content is preliminarily screened out, wherein the matching content is the content which the user wants to know.

In this embodiment, the positioning manner of the frame selection area may be: firstly, during frame selection, recording initial point coordinates of a mouse for frame selection as (X1, Y1), recording end point coordinates as (X2, Y2), wherein the initial point and the end point enclose a rectangular frame selection area, and a user may pull a scroll bar of a browser before screen capture, so that the coordinates of the frame selection area are superposed with up-down and left-right displacement caused when the user pulls the scroll bar of a webpage to obtain absolute coordinate values of the initial point and the end point of the frame selection area, which are respectively (X1+ ScrollLeft, Y1+ ScrollTop) and (X2+ ScrollLeft, Y2+ ScrollTop), wherein ScrollLeft is a value of left-right pulling of the webpage, and ScrollTop is a value of up-down pulling of the webpage.

Then, coordinates of each content in the web page are obtained, the coordinates of any content A are recorded as (Xa, Ya), the length of the content A is W, and the width of the content A is H.

And finally, judging whether the element A on the webpage is contained in the area selected by the user frame by adopting an exclusion method, and judging that the content A is not in the selected area when Xa + W < X1+ ScrollLeft, or X2+ ScrollLeft < Xa, or Ya + H < Y1+ ScrollTop, or Y2+ ScrollTop < Ya, or Xa < X1+ ScrollLeft, Ya < Y1+ ScrollTop, and Xa + W > X2+ ScrollLeft, and Ya + H > Y2+ ScrollTop, or otherwise, judging that the content A is in the selected area.

Repeating the three steps to obtain all the contents framed and selected in the webpage.

In the third embodiment, since the second embodiment only obtains the content selected by the user frame in the webpage, but cannot obtain the content that is not selected by the user frame and is also the content required by the user, the commonalities of the contents in the frame selection area need to be found, and then the webpage is crawled to obtain all the content required by the user in the webpage.

In the embodiment, the webpage source codes are classified and labeled through machine learning;

referring to fig. 2, the user operation behavior is analyzed by using artificial intelligence, specifically, the track simulation of user frame selection is performed on the webpage content concerned by the user through machine learning, user operation simulation, intelligent bid alignment, drill-down crawling, and deep mining is performed on the webpage, so that the content which is not frame-selected by the user and is required by the user in the webpage is obtained.

The method comprises the following specific steps:

firstly, collecting all contents in a boxed area to form a set B, obtaining a first element B1 and a last element bn from the set B, analyzing the first element B1 and the last element bn to obtain a common parent node of the first element B1 and the last element bn, if the parent node hierarchies of the first element B1 and the last element bn are different, considering that the two elements are not of the same type, discarding the last element bn, then obtaining an element bn-1 again, analyzing the common parent node of the first element B1 and the element bn-1, and so on until an element bm which has a common parent node with B1 is found;

then, whether the patterns of the b1 and the bm are the same or not is analyzed, if the patterns of the b1 and the bm are different, the bm element is discarded to obtain the bm-1 element again, the patterns of the b1 and the bm-1 element are analyzed again, and the like is carried out until the elements b1 and bz which have a common pattern are found;

then respectively acquiring all father nodes of the b1 element and the bz element as list1 and listz, and comparing the same node with the maximum level of list1 and listz as node1, wherein the node1 is the nearest common father node of the b1 element and the bz element;

finally, using the node1 node to find an element having a common style with the b1 element, and obtaining a set Y { b1, … …, bz }, where the set Y is the content that the user needs to obtain.

In the fourth embodiment, when the content of the web page is more, crawling the web page may consume a certain time, and if the crawling manner of the third embodiment is performed for each frame selection, the working efficiency may be low.

In different frame selection operations, the content selected by the frame and the content to be crawled out are possibly the same, if the content crawled in history can be provided for the user again and real-time relevant information of the corresponding content is provided, the webpage crawling times can be reduced, and the working efficiency is improved.

In this embodiment, training is performed by obtaining parameters framed by a user and corresponding contents thereof, where the parameters are coordinates of a starting point and coordinates of an ending point of framing (X1, Y1, X2, and Y2), and the parameters are used as inputs, and the corresponding framed contents are used as output classifications to construct an SVM classifier, so as to perform determination of the same region according to a classification result.

Comparing the real-time frame selection area of the user with the historical frame selection area or the historical frame selection areas of other users, and judging whether the real-time frame selection areas belong to the same frame selection area; when the judgment result shows that the two items are the same, acquiring real-time related information of the framed content according to historical framing; and when the judgment result is that the content is not the same, crawling the content which is not selected by the user in the frame and is needed by the user and the related information of the content.

Such as: a user A selects a microblog webpage frame every yesterday, selects three contents of sports, food and military, crawls the whole webpage by the system, crawls the contents which are not selected by the user in the frame and are needed by the user in the webpage, and provides related information of the contents, such as yesterday sports, food and military news titles and websites. The user B selects the microblog webpage frame at present, although the starting point and the end point of the frame selection are different, after judgment, the content in the frame selection is also sports, food and military content, namely the frame selection area of the user B is the same as the frame selection area of the user A, at the moment, the content which is not selected by the user and is needed by the user in the webpage does not need to be crawled again, the webpage crawling result of the user A is directly recommended to the user B, and then the real-time relevant information of the corresponding crawling result, such as news titles and websites of the sports, food and military, of the user B is recommended to the user B, namely the relevant information corresponding to the same content in the same frame selection area can be correspondingly updated due to different dates.

In other embodiments, please refer to fig. 3, on the basis of the fourth embodiment, a user supervision mechanism is added to the classifier, and the user determines the classification result, determines whether the content is the content concerned by the user, and adds the determination result to the training set for the next training; the training set is cleaned and trained regularly, noise generated due to misoperation of a user is combined, a correct judgment result is finally stored in the training set, and the training result is called when the classifier is used, so that resource waste caused by repeated training is avoided.

The specific steps refer to fig. 4:

1. the user selects a box and obtains input parameters (x1, y1, x2, y 2).

2. And inputting the parameters, classifying through an svm classifier constructed by machine learning to obtain a classification result (a standard frame selection area) and displaying on a frame selection page.

3. And the user judges whether the content of the classification result is correct or not and whether the content is the content required by the user, so that interaction is realized and a training set is perfected.

4. And if the user judges that the content of the standard frame selection area is not the wanted content, setting the acquired parameters into a new class, storing the new class into a database, crawling the webpage content, and outputting the result.

5. If the user judges that the content of the standard frame selection area is the desired content, acquiring the classified corresponding content, outputting the crawled result to avoid repeated crawl, and storing the parameter into a database to perfect the training set data.

6. The training set data is cleaned regularly, noise classes (actually pointed content repetition) generated by misoperation of a user are combined, and classification accuracy is improved.

7. Training the training set regularly, storing the training result, calling the training result when using the classifier, and avoiding resource waste caused by repeated training.

In a fifth embodiment, on the basis of the first embodiment, the second embodiment, the third embodiment or the fourth embodiment, the users are classified through the tags of the users, so that intelligent recommendation frame selection is realized, the tags include industries, positions and regions, machine learning, supervised learning and reinforcement learning are continuously performed on the user oriented frame selection behaviors according to the identity characteristics of the users, and therefore automatic recommendation frame selection is intelligently performed on the contents concerned by the users. And automatically recommending the frame selection. And the intelligent recommendation frame is also added with a user judgment mechanism, user behavior data sample classification training is carried out through a Bayesian classifier, when the recommendation frame selection meets the user requirement, the classifier automatically stores the user behavior and the recommendation result into a data sample library, when the recommendation frame selection does not meet the user requirement, a program automatically jumps to a manual frame selection interface of the user, and simultaneously learns the user behavior and the frame selection result, so that the accuracy of automatically recommending the frame selection to the user by the classifier is improved.

In this embodiment, referring to fig. 5, the specific operation steps are as follows:

1. and when the user enters a frame selection page, the system prompts whether to start the automatic recommendation frame selection, and if not, the user jumps to manual frame selection.

2. If so, acquiring the label of the user as an input parameter.

3. And constructing a Bayesian classifier, classifying according to the input parameters, taking the probability of the attention of the user to each class as a result, and outputting the result with the maximum probability.

4. And the user judges whether the automatic recommendation accords with the reality, if so, the result is output, and the user and the result are stored in the database to complete the training sample.

5. And if not, skipping to enter a manual frame selection interface, performing manual frame selection, and storing the user and the result into a database to complete the training sample.

In summary, compared with the prior art, the method for directionally monitoring the web page provided by the application directly obtains the relevant information of the framed content by framing the web page, is simple and quick to operate, can automatically obtain the content which is the same as the framed content and is not framed on the web page, avoids multiple framing on the web page by a user, and improves the working efficiency of the user. The method can also record historical frame selection operation, judge whether the contents corresponding to different frame selections are consistent, provide the contents and related information obtained by the historical frame selection operation when the contents are consistent, avoid repeated web page crawling and resource waste, and simultaneously add a manual supervision mechanism and a manual judgment mechanism to improve the accuracy and reliability of the method.

The invention has been described above with reference to the accompanying drawings, it is obvious that the invention is not limited to the specific implementation in the above-described manner, and it is within the scope of the invention to apply the inventive concept and solution to other applications without substantial modification.

Claims

1. A method for carrying out directional monitoring on a webpage is characterized in that the contents on the webpage are selected in a frame mode, each selected content in the frame mode is captured, and relevant information of each content is given, wherein the relevant information comprises a title, an abstract, a website address and a webpage text; collecting all contents in the framed selection area to form a set B, obtaining a first element B1 and a last element bn from the set B, analyzing the first element B1 and the last element bn to obtain a common parent node of the first element B1 and the last element bn, if the parent node hierarchies of the first element B1 and the last element bn are different, considering that the two elements are not of the same type, discarding the last element bn, then obtaining an element bn-1 again, analyzing the common parent node of the first element B1 and the element bn-1, and so on until an element bm which has a common parent node with B1 is found; analyzing whether the patterns of the b1 and the bm are the same, if the patterns of the b1 and the bm are different, discarding the bm element to obtain a bm-1 element again, analyzing the patterns of the b1 and the bm-1 element again, and so on until finding the elements b1 and bz which have a common pattern; respectively obtaining all father nodes of a b1 element and a bz element as list1 and listz, and comparing the same node with the maximum level of list1 and listz as node1, wherein the node1 is the nearest common father node of the b1 element and the bz element; and searching for an element having a common style with the b1 element by using the node1 node, and acquiring a set Y { b1, … …, bz }, wherein the set Y is the content required to be acquired by the user.

2. The method of claim 1, wherein the frame selection is performed by a screen shot positioning method.

3. The method as claimed in claim 1, wherein the positioning information is obtained according to the user selection area, and then the positions of all elements in the web page are compared with the positions of the user selection content, so as to primarily screen out the matching content, which is the content that the user wants to know.

4. The method for monitoring web page orientation according to claim 3, wherein during the frame selection, the coordinates of the initial point of the frame selection are recorded as (X1, Y1), the coordinates of the end point are recorded as (X2, Y2), the initial point and the end point enclose a rectangular frame selection area, the coordinates of the frame selection area are superimposed with the vertical and horizontal displacements caused by the user pulling the web page scroll bar, so as to obtain the absolute coordinate values of the initial point and the end point of the frame selection area, which are respectively (X1+ ScrollLeft, Y1+ ScrollTop) and (X2+ ScrollLeft, Y2+ ScrollTop); acquiring coordinates of each content in a webpage, and recording the coordinates of any content A as (Xa, Ya), wherein the length of the content A is W, and the width of the content A is H; and judging whether the element A on the webpage is contained in the area selected by the user frame by adopting an exclusion method, and judging that the content A is not in the frame selection area when Xa + W < X1+ ScrollLeft, or X2+ ScrollLeft < Xa, or Ya + H < Y1+ ScrollTop, or Y2+ ScrollTop < Ya, or Xa < X1+ ScrollLeft, Ya < Y1+ ScrollTop, and Xa + W > X2+ ScrollLeft, and Ya + H > Y2+ ScrollTop, or judging that the content A is in the frame selection area.

5. The method for directionally monitoring web pages as claimed in claim 1, wherein the web page source code is classified and labeled through machine learning; the track simulation of user frame selection is carried out on webpage contents concerned by the user through machine learning, user operation simulation, intelligent bid alignment and drill-down crawling, so that the contents which are not frame selected by the user and are needed by the user are crawled.

6. The method for directionally monitoring the webpage according to claim 5, wherein the real-time frame selection area of the user is compared with the historical frame selection area or the historical frame selection areas of other users to judge whether the real-time frame selection areas belong to the same frame selection area; when the judgment result shows that the two items are the same, acquiring real-time related information of the framed content according to historical framing; and when the judgment result is that the content is not the same, crawling the content which is not selected by the user in the frame and is needed by the user and the related information of the content.

7. The method of claim 6, wherein the same frame region is determined by obtaining coordinate locations of a start point and an end point (X1, Y1, X2, Y2) of the user frame as input parameters to construct an SVM classifier, classifying the frame region according to the locations of all contents between the start point and the end point in the whole webpage, and determining the same region according to the classification result.

8. The method for directionally monitoring the webpage according to claim 7, wherein the classifier is added to a user supervision mechanism, and the user judges whether the classification result is the content concerned by the user, and adds the judgment result to a training set for next training; the training set is cleaned and trained regularly, noise generated due to misoperation of a user is combined, a correct judgment result is finally stored in the training set, and the training result is called when the classifier is used, so that resource waste caused by repeated training is avoided.

9. The method for directionally monitoring the webpage according to the claim 1 or 5, wherein the machine learning, the supervised learning and the reinforcement learning are continuously performed on the directional frame selection behaviors of the user according to the identity characteristics of the user, so that the automatic recommendation frame selection is intelligently performed on the contents concerned by the user; the automatic recommendation frame selection is that a Bayesian classifier is used for carrying out classification training on user behavior data samples, when the recommendation frame selection meets the user requirements, the classifier automatically stores the user behaviors and recommendation results into a data sample library, when the recommendation frame selection does not meet the user requirements, a program automatically skips to a manual frame selection interface of a user, and simultaneously learns the user behaviors and the frame selection results, so that the accuracy of automatic recommendation frame selection of the user by the classifier is improved.