CN106612279B

CN106612279B - Network address processing method, equipment and system

Info

Publication number: CN106612279B
Application number: CN201611199113.4A
Authority: CN
Inventors: 柴斌
Original assignee: WUXI PUBLIC SECURITY BUREAU; Beijing Knownsec Information Technology Co Ltd
Current assignee: WUXI PUBLIC SECURITY BUREAU; Beijing Knownsec Information Technology Co Ltd
Priority date: 2016-12-22
Filing date: 2016-12-22
Publication date: 2020-04-17
Anticipated expiration: 2036-12-22
Also published as: CN106612279A

Abstract

The invention discloses a method for processing a network address, which comprises the following steps: acquiring network content pointed by a network address; acquiring text content contained in network content; dividing text content into at least one sentence; for each statement, extracting at least one key word of the statement; for each key word, generating a feature vector of the key word in the network content, so that a feature vector group of the network content is formed by the feature vector of at least one key word; for each network address set in the data storage equipment, calculating a difference value between the network content and the network address set according to the feature vector group; if the difference value is less than or equal to the difference threshold value of the network address set, determining that the network address and the network address set point to similar network content; and storing the network address, the network content pointed by the network address and the characteristic vector group of the network content into the network address set. The invention also discloses a network address analysis method, equipment and a system.

Description

Network address processing method, equipment and system

Technical Field

The present invention relates to the field of information security technologies, and in particular, to a method, a device, and a system for processing a network address.

Background

With the rapid development of network communication technology, the continuous deepening of internet application and the increasing abundance of carried information, the internet has become an important infrastructure of human society. Meanwhile, various events endangering network security emerge endlessly, and great attention of society on network security is brought about.

Some lawbreakers exist, and cheat users to trust and damage the benefits of the users by inducing the users to access malicious network addresses pointing to malicious network contents. The loss caused by such malicious activities is becoming more and more serious due to the popularization and development of electronic commerce and internet applications.

At present, a malicious network address storage device can be docked by a browser or a network switching device to realize the function of safely browsing network contents. The method comprises the steps that before a browser acquires a webpage or a network switching device sends an access request, a network address is inquired in a malicious network address storage device, if the network address exists in the malicious network address storage device, the browser stops acquiring network content of the network address or the network switching device stops sending the access request, and warning information is displayed.

Data in the malicious network address storage device mainly come from two modes of direct reporting by a user and automatic analysis by a program. The automatic analysis of the program generates malicious network address data in the following way: and (3) inducing the characteristics of the malicious network content by analyzing a certain amount of malicious network content samples, analyzing other network contents by using the characteristics, classifying the network contents as the malicious network content if the characteristics are matched, and adding the corresponding network address into the malicious network address storage equipment.

However, the actual network environment is complex and changes rapidly, a considerable part of malicious network addresses fail soon after being online, and the malicious network contents cannot be acquired, so that the malicious network contents which implement fraud at that time cannot be analyzed, and then the chance of capturing samples is lost, and the corresponding contents in the malicious network address storage device are lost, and the data coverage rate is not high. Therefore, the coverage rate of the malicious network address storage equipment is improved, and the supplement of the sample range is very important for guaranteeing the network security.

Therefore, a solution that can supplement the sample of malicious network address storage devices is urgently needed.

Disclosure of Invention

To this end, the present invention provides a network address handling scheme in an attempt to solve or at least alleviate at least one of the problems presented above.

According to an aspect of the present invention, there is provided a method for processing network addresses, adapted to be executed in a network address analysis device, the network address analysis device being connected to a data storage device, the data storage device storing at least one network address set, the network address set including at least one network address pointing to similar network content and network content, feature vector groups of the network content, to which each of the at least one network address points, the method including the steps of: acquiring network content pointed by a network address; acquiring text content contained in network content; dividing the text content into at least one sentence according to the punctuation marks; for each statement, extracting at least one key word of the statement; for each key word, generating a feature vector of the key word in the network content according to the position of the key word in the text content, so that a feature vector group of the network content is formed by the feature vector of at least one key word; for each network address set in the data storage equipment, calculating a difference value between the network content and the network address set according to the feature vector group; if the difference value is less than or equal to the difference threshold value of the network address set, determining that the network address and the network address set point to similar network content; and storing the network address, the network content pointed by the network address and the feature vector group of the network content into the network address set. .

According to another aspect of the present invention, there is provided a network address analysis method adapted to be executed in a network address analysis system, the network address analysis system including a task assigning device, a data storage device and a network address analysis device, the data storage device storing analysis records of network addresses and a network address set, the network address set including at least one network address pointing to similar network content, the method including the steps of: receiving a network address to be analyzed via a task allocation device; the processing method of the network address according to the invention is executed via the network address analysis device in order to store the network address in a set of network addresses of the data storage device with which similar network content is directed.

According to another aspect of the present invention, there is provided a network address analyzing device connected to a data storage device, the data storage device storing at least one network address set, the network address set including at least one network address pointing to similar network content and a network content, a feature vector set of the network content, to which each of the at least one network address points, the device including: the content acquisition module is suitable for acquiring the network content pointed by the received network address; the word extraction module is suitable for acquiring text content contained in the network content; dividing the text content into at least one sentence according to the punctuation marks; for each statement, extracting at least one key word of the statement; the characteristic generating module is suitable for generating a characteristic vector of each key word in the network content according to the position of the key word in the text content so as to form a characteristic vector group of the network content by the characteristic vector of at least one key word; the set judgment module is suitable for calculating the difference value between the network content and each network address set according to the characteristic vector group for each network address set in the data storage equipment; if the difference value is less than or equal to the difference threshold value of the network address set, determining that the network address and the network address set point to similar network content; and storing the network address, the network content pointed by the network address and the feature vector group of the network content into the network address set.

According to yet another aspect of the present invention, there is provided a network address analysis system comprising a data storage device, a task allocation device, and a network address analysis device according to the present invention, wherein the data storage device is adapted to store an analysis record of network addresses, and a set of network addresses, the set of network addresses comprising at least one network address pointing to similar network content; the task allocation device is suitable for receiving a network address to be analyzed and sending the network address to the network address analysis device; the network address analysis device is adapted to process the network address, store the network address in a set of network addresses in the data storage device that point to similar network content.

According to the processing scheme of the network address, the network content pointed by the network address is obtained, and the feature vector group of the text content contained in the network content is extracted, so that the characteristic that the network content is uniquely identified by the feature vector group is achieved. Meanwhile, whether the two network contents are similar network contents is judged by calculating the difference value of the feature vector groups of the two network contents, the judging process is simple, convenient and effective, and the accuracy is high. Furthermore, the network addresses pointing to the similar network contents are stored in the same network address set, so that other malicious network addresses pointing to the similar network contents with one malicious network address can be provided for the malicious network address storage device, the samples of the malicious network address storage device are greatly supplemented, the coverage rate of the malicious network address storage device is remarkably improved, and safer services can be provided for users.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

Fig. 1 shows a block diagram of a network system 100 according to an exemplary embodiment of the present invention;

FIG. 2 illustrates a block diagram of a network address analysis system 200 according to an exemplary embodiment of the present invention;

fig. 3 shows a block diagram of a network address analysis device 300 according to an exemplary embodiment of the present invention; and

FIG. 4 illustrates a schematic diagram of supplementing malicious network addresses, according to an example embodiment of the present invention;

FIG. 5 illustrates a flow diagram of a network address analysis method 500 in accordance with an exemplary embodiment of the present invention; and

fig. 6 shows a flow diagram of a method 600 for processing a network address according to an example embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a block diagram of a network system 100 according to an exemplary embodiment of the present invention. Network system 100 may include clients 110, network switching devices 120, and malicious network address storage devices 130. A user may access web content pointed to by a web address in the internet via the web switching device 120 through a browser residing on the client 110.

Generally, to secure the network of the user, the network switching device 120 may connect to a malicious network address storage device 130, the malicious network address storage device 130 storing a known malicious network address. Before sending the access request of the user to the network server corresponding to the network address, the network switching device 120 may first determine whether the network address is a malicious network address, that is, query whether the network address exists in the malicious network address storage device 130. If the network address exists, the network address is a malicious network address, and the network switching device 120 may return a warning message to the user to prompt the user that the network address is a malicious network address. If not, the network switching device 120 sends the access request to the network server corresponding to the network address without intervening in the access.

Here, the determination and intervention of the malicious network address may also be implemented by connecting a browser on the client 110 to the malicious network address storage device 130, which is not limited by the present invention.

However, the real network environment is complex and changes very quickly, and a part of malicious network addresses are invalid soon after being online, so that malicious network contents cannot be acquired. Thus, the malicious network content that implements the fraud at that time cannot be analyzed, and the chance of capturing the sample is lost, resulting in the loss of the corresponding content in the malicious network address storage device 130, limited samples, and low data coverage. Therefore, the analysis processing is also required for the network addresses that do not exist in the malicious network address storage device 130.

Fig. 2 shows a block diagram of a network address analysis system 200 according to an exemplary embodiment of the present invention. The network switching device 120 (or the browser on the client 110) may be connected to the network address analysis system 200, and when the network switching device 120 queries that the network address does not exist in the malicious network address storage device 130, the network address may be sent to the network address analysis system 200 as the network address to be analyzed for analysis processing. Further, the network switching device 120 may also perform white list filtering on network addresses that do not exist in the malicious network address storage device 130, and then send network addresses that miss the white list to the network address analysis system 200.

As shown in fig. 2, the network address analysis system 200 may include a task assigning apparatus 210, a data storage apparatus 220, and at least one network address analysis apparatus 220. Here, each device may be implemented as a server, such as a file server, a database server, an application server, a WEB server, and the like, and may also be implemented as a personal computer including a desktop computer and a notebook computer configuration. Further, it may also be implemented as part of a small-sized portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that may include any of the above functions.

The task assigning device 210 is connected to the network switching device 120, and may receive the network address to be analyzed sent by the network switching device 120, and then send the network address to the network address analyzing device 300.

The data storage device 220 is connected to the task assigning device 210 and stores an analysis record of at least one network address and at least one network address set. Wherein the analysis record of the network address indicates that the network address was processed and includes a processing time for the network address, and the set of network addresses includes at least one network address pointing to similar network content.

According to an embodiment of the present invention, in order to save system resources and reduce device load, after receiving the network address to be analyzed, the task assigning device 210 may query whether an analysis record of the network address exists at the data storage device 220, i.e., determine whether the network address has been processed.

If no analysis record for the network address exists at the data storage device 220, indicating that the network address has not been processed, the network address may be sent to the network address analysis device 300 and an analysis record for the network address may be created in the data storage device 220.

If there is an analysis record for the network address at the data storage device 220, indicating that the network address has been processed, then a determination is made whether the difference between the processing time in the analysis record for the network address and the current time exceeds a time threshold (e.g., exceeds 1 hour or 24 hours).

If the time difference does not exceed the time threshold, the network address has already been processed and the network content to which it points can be considered unchanged and does not need to be processed again, so the network address is ignored. If the time difference exceeds the time threshold, the network address has been processed, but the network content to which it points may have changed, and therefore the network address is still sent to the network address analysis device 300, and the analysis record for the network address is updated in the data storage device 220.

As shown in fig. 2, the network address analysis system 200 may include a plurality of network address analysis devices 300, and at this time, when there is no analysis record of the network address at the data storage device 220, the task assigning device 210 may transmit the network address to the network address analysis device 300 whose current workload is the smallest, while recording the network address analysis device 300 in the analysis record. When there is an analysis record of the network address at the data storage device 220 and the time difference exceeds the time threshold, the task assigning device 210 may send the network address to the network address analysis device 300 that processed the network address in the analysis record.

The network address analysis device 300 is connected to the task assigning device 210 and the data storage device 220, respectively, and is capable of receiving the network address sent by the task assigning device 210, processing the network address, and finally storing the network address in the data storage device 220 in a network address set pointing to similar network content.

Fig. 3 shows a block diagram of a network address analysis apparatus 300 according to an exemplary embodiment of the present invention. As shown in fig. 3, the network address analysis apparatus 300 may include a content acquisition module 310, a word extraction module 320, a feature generation module 330, and a set decision module 340.

The content obtaining module 310 may receive a network address from the task assigning apparatus 210 and obtain network content pointed to by the received network address.

The word extraction module 320 is connected to the content obtaining module 310, and obtains text content included in the network content obtained by the content obtaining module 310. In view of that, in general, malicious web content may be displayed as a picture in order to avoid malicious analysis, according to an embodiment of the present invention, the word extraction module 320 may further obtain a picture included in the web content, identify text content included in the picture, and use the text content included in the picture as a part of the text content included in the web content.

After obtaining the text content, the word extraction module 320 divides the text content into at least one sentence according to the punctuation marks. Specifically, the content from the start position of the text content to the first punctuation mark may be used as a sentence, and then, from the first punctuation mark, the content from each punctuation mark to the next punctuation mark may be used as a sentence, and the content from the last punctuation mark to the end position of the text content may be used as a sentence. Where punctuation may include commas, periods, semicolons, and exclamation marks.

Next, the term extraction module 320 extracts, for each sentence, at least one key term of the sentence. Specifically, the word extraction module 320 may continue to divide the divided sentence into at least one word. According to an embodiment of the present invention, the words may be divided by performing longest matching in a dictionary, for example, the dictionary includes the words: black cloud, black cloud capping and black cloud dense. The sentence includes the following contents: ".. diffuse dark clouds. The word extraction module 320 may make the longest match in the dictionary word by word, obviously "wandering" will match a word, but "wandering" will not match any word starting with wandering, so "wandering" is divided into one word. When the 'black' is matched, the 'black cloud' can be matched with a word, but the 'black cloud dense' can be matched with a longer word, so that the 'black cloud dense' is divided into a word. Here, the method for dividing words is not limited in the present invention, and a method for realizing the function of dividing words is within the scope of the present invention.

After the words are divided, the word extraction module 320 may extract key words therein. In particular, the term extraction module 320 may be coupled to a part-of-speech query device via which the term extraction module 320 may obtain the part-of-speech of each term. Wherein the part of speech query device can be implemented as most of the online dictionaries, the queried part of speech at least can include: n (noun), adv (adverb), prep (prepose), v (verb), pronoun (pronoun), abbr (abbreviation), adj (adjective). Assume that there is one statement as follows: "the supervisors collect websites of malicious acts such as actual fraud on the user", the word extraction module 320 may divide them into words having the following parts of speech: supervision (n), personnel (n), collection (v), pair, user (n), conduct (v), actual (adj), fraud (v), etc. (aux), malicious (adj), behavior (n), (prep), website (n).

Then, the word extraction module 320 may extract words corresponding to the subject, the predicate, and the object in at least one of the divided words according to the part of speech of each word and a preset sentence structure rule, and use the words as key words. The preconfigured sentence structure rules may include at least the following rules: adjectives are definite words of nouns; verbs are predicates, wherein in the absence of a passive structure, the noun before the verb is the subject and the noun after the verb is the object; the noun after "is used as the subject, and the subject is the subject before the verb and the object after the verb.

After the word extraction module 320 extracts the key words, the feature generation module 330 connected to the word extraction module 320 generates, for each key word extracted by the word extraction module 320, a feature vector of the key word in the network content according to the position of the key word in the text content, so that the feature vector of the at least one key word forms a feature vector group of the network content.

According to one embodiment of the invention, the feature vector may include a word identity (WId) of the keyword, a word position identity (WPos) for uniquely identifying the keyword, and a sentence position identity (SId) generated from a position of the word in the text content. Specifically, the feature generation module 330 may obtain at least one sentence including the keyword, and then generate, for each sentence, a word position identifier and a sentence position identifier in a feature vector according to the position of the keyword in the sentence and the position of the sentence in the text content, respectively.

In summary, mapping a position of a keyword in the web content to a three-dimensional vector (WId, WPos, SId) is completed. And generating feature vectors of all the key words, namely completing the mapping of the network content pointed by one network address to a group of three-dimensional vectors. In this way, a network content can be more intuitively expressed: the key information in a network content is composed of a three-dimensional curved surface, and each point on the curved surface represents a three-dimensional vector of each key word.

Meanwhile, the judgment of whether the two network addresses point to similar network contents is converted into the similarity judgment of the two curved surfaces, and the method is simple and convenient in process and high in accuracy. As will be explained below.

The set determining module 340 is respectively connected to the feature generating module 330 and the data storage device 220, and the set of network addresses stored in the data storage device 220 includes at least one network address pointing to similar network content, and a set of feature vectors of the network content and the network content pointed to by each of the at least one network address.

The set decision module 340 receives the feature vector group of the network address to be analyzed generated by the feature generation module 330, and then calculates a difference value between the network content and the network address set according to the feature vector group for each network address set in the data storage device 220.

Specifically, the difference value between the network content and the network content pointed by each network address in the network address set may be calculated first, and then the difference value between the network content and the network address set may be calculated according to the difference value between the network content pointed by each network address and the network content pointed by each network address. According to an embodiment of the present invention, the difference value between the network content and the network address set may be equal to an average value of the difference values between the network content and the network content pointed by the network addresses in the network address set, that is, a sum of the difference values between the network content and the network content pointed by each network address in the network address set is divided by the number of network addresses in the network address set.

According to an embodiment of the present invention, the process of calculating the difference value between the two network contents by the set decision module 340 may be as follows:

extracting word identifications in the feature vector group of the network content pointed by the network address in the network content and network address set, and respectively calculating the feature values of key words corresponding to the word identifications in the network content and the network content pointed by the network address in the network address set for each word identification.

Specifically, the process of the set judgment module 340 calculating the feature value of the key term corresponding to the term identifier in a piece of network content may be as follows:

and searching whether at least one characteristic vector containing the word identifier exists in the characteristic vector group of the network content, and if so, calculating a characteristic value of a key word corresponding to the word identifier in the network content according to the searched at least one characteristic vector. And for each searched at least one feature vector, calculating a feature value of the key word in the feature vector according to the word position identifier and the sentence position identifier in the feature vector, and calculating a feature value of the key word in the network content according to the feature value of the key word in each feature vector.

Here, the feature value of the keyword in one feature vector may be equal to the sum of squares of the word position indicator and the sentence position indicator in the feature vector, i.e., WPos²+Sid². The feature value of the keyword in a piece of network content may then be equal to the sum of the feature values of the keyword in each feature vector.

If the feature vector containing the word identifier does not exist in the feature vector group of the network content, the feature value of the key word corresponding to the word identifier in the network content is set to be 0.

Then, the difference value of the keyword in the two network contents can be calculated according to the feature value of the keyword in the two network contents, for example, the difference value of the keyword in the two network contents can be equal to the difference between the feature values of the keyword in the two network contents.

And finally, calculating a difference value of the two network contents according to the number of the key words corresponding to the extracted word identifications and the difference value of each key word, for example, the difference value of the two network contents can be equal to an average value of the difference values of the extracted key words in the two network contents, that is, the sum of the difference values of each key word in the two network contents is divided by the number of the key words.

After calculating the difference between the network content and the network address set, the set determination module 340 may determine whether the network address and the network address set point to similar network content according to the difference between the network content and the network address set.

In particular, each set of network addresses stored by the data storage device 220 has a difference threshold. The set decision module 340 compares the difference between the network content and the network address set with the difference threshold of the network address set to determine whether the network address and the network address set point to similar network content.

If the difference value is greater than the difference threshold value of the network address set, it is determined that the network address and the network address set do not point to similar network content.

According to an embodiment of the present invention, if the set decision module 340 determines that the network address and all the network address sets in the data storage device 220 do not point to similar network contents, a new network address set may be created in the data storage device 220, and the network address, the network contents pointed by the network address, and the feature vector group of the network contents are stored in the new network address set, and the difference threshold of the new network address set is an initial difference threshold, which may be generally obtained from experience.

If the difference value is less than or equal to the difference threshold of the network address set, it is determined that the network address and the network address set point to similar network content, and the set determination module 340 stores the network address, the network content pointed by the network address, and the feature vector set of the network content into the network address set. It is noted that one network address may be stored in multiple sets of network addresses.

According to an embodiment of the present invention, the set decision module 340 may further update the difference threshold of the network address set after storing the network address, the network content pointed to by the network address, and the feature vector group of the network content into the network address set. Specifically, for every arbitrary two network addresses in the network address set, a difference value of network contents pointed by the two network addresses is calculated. And then calculating the difference threshold value of the network address set at least according to the difference value of the network contents pointed by every two network addresses.

Further, according to another embodiment of the present invention, the process of calculating the difference threshold of the network address set may be as follows: and calculating a reference difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses, and then calculating the difference threshold value of the network address set according to the initial difference threshold value, the reference difference threshold value and respective weights. Wherein, the weights of the initial difference threshold and the reference difference threshold can be respectively decreased and increased along with the increase of the number of the network addresses in the network address set.

Therefore, when a small number of network addresses are added into the network address set, the calculated difference threshold value cannot deviate too much from the initial difference threshold value, and the phenomenon that the next similar network address cannot be added into the network address set due to the fact that the difference threshold value is excessively increased and the similarity threshold is increased after the small number of high similar network addresses are added is avoided.

And as the number of the network addresses in the network address set increases, the initial difference threshold value does not have a reference value any more, so that the weight of the initial difference threshold value is gradually reduced, and after the network address set contains a large number of network addresses, the calculated new difference threshold value approaches to the average value of the network address set, so that the judgment of similar network contents is more accurate.

The following example further explains a process in which the network address analysis device 300 calculates a difference value between two network contents.

Assume that there is a network address a: com and network address B: com.

The network content pointed by the network address A is obtained, and the text content contained in the network content pointed by the network address A is 'May you obtain second-class prize of anniversary celebration of the company, and you have obtained that the company sends ￥ 98000 yuan of Jinxi bonus to be matched with an apple notebook computer provided by the apple company under sponsorship'.

Dividing text content contained in the network content pointed by the network address A into the following sentences;

statement 1-congratulations you get second-class awards for the company's anniversary celebration;

statement 2-you have obtained the company sending out the surprise bonus ￥ 98000 yuan with the apple notebook computer offered by apple company as a sponsor.

Extracting key words corresponding to the subject (S), the predicate (O) and the object (P) from the sentence 1:

you (S), get (P), award (O);

extracting key words corresponding to the subject (S), the predicate (O) and the object (P) from the sentence 2:

you (S), get (P), bonus (O), computer (O).

For the key word "you," the word identification WId, which can uniquely identify the key word, is 1. "you" have a position order of 1 in all the key words of the sentence 1, and therefore the word position identification WPos is 1, and the sentence 1 has a position order of 1 in all the sentences of the network content, and therefore the sentence position identification SId is 1, and a feature vector (1, 1, 1) is generated. "you" have a position order of 1 in all the key words of the sentence 2, and therefore the word position identification WPos is 1, and the sentence 2 has a position order of 2 in all the sentences of the network content, and therefore the sentence position identification SId is 2, and a feature vector (1, 1, 2) is generated. Finally, two feature vectors of the key word "you" are generated as (1, 1, 1) and (1, 1, 2), respectively.

For the keyword "get," the word identification WId that can uniquely identify the keyword is 2. The "get" has a position order of 2 in all key words of the sentence 1, and thus the word position identification WPos is 2, and the sentence 1 has a position order of 1 in all sentences of the network content, and thus the sentence position identification SId is 1, and generates the feature vector (2, 2, 1). The "get" has a position order of 2 in all key words of the sentence 2, so the word position identification WPos is 2, and the sentence 2 has a position order of 2 in all sentences of the network content, so the sentence position identification SId is 2, and generates the feature vector (2, 2, 2). Finally, two feature vectors for generating the key word "get" are (2, 2, 1) and (2, 2, 2), respectively.

For the keyword "prize," the term identification WId, which may uniquely identify the keyword, is 3. The "prize" is 3 in the order of position in all key words of sentence 1, so the word position identity WPos is 3, and sentence 1 is 1 in the order of position in all sentences of the network content, so the sentence position identity SId is 1, and finally one feature vector (3, 3, 1) of the key word "prize" is generated.

For the key term "bonus," the term identification WId that can uniquely identify the key term is 4. The "bonus" is located in order 3 in all key words of sentence 2, so the word location identity WPos is 3, sentence 2 is located in order 2 in all sentences of the web content, so the sentence location identity SId is 2, and finally one feature vector (4, 3, 2) of the key word "bonus" is generated.

For the keyword "computer," 5 may be the word id WId that uniquely identifies the keyword. The "bonus" has a position order of 4 in all key words of sentence 2, so the word position identity WPos is 4, and sentence 2 has a position order of 2 in all sentences of the web content, so the sentence position identity SId is 2, and finally a feature vector (5, 4, 2) of the key word "computer" is generated.

Therefore, the feature vector group of the network content pointed to by the network address a is { (1, 1, 1), (1, 1, 2), (2, 2, 1), (2, 2, 2), (3, 3, 1), (4, 3, 2), (5, 4, 2) }.

Similarly, the network content pointed to by the network address B is obtained, and the text content included in the network content pointed to by the network address B is obtained as follows: "your account has been systematically drawn into a second-prize lucky user on a celebration event of the year. You can obtain the surprise bonus sent by the company for you and one computer. ".

Dividing text content contained in the network content pointed by the network address B into the following sentences;

statement 1 — your account has been systematically raffled into second-prize lucky users of the anniversary celebration event;

system (S), decimation (P), user (O);

you (S), get (P), bonus (O), computer (O).

For the keyword "system," the word identification WId that can uniquely identify the keyword is 6. The "system" has a position order of 1 in all the key words of the sentence 1, and thus the word position identification WPos is 1, and the sentence 1 has a position order of 1 in all the sentences of the network content, and thus the sentence position identification SId is 1, and finally one feature vector (6, 1, 1) of the key word "system" is generated.

For the keyword "decimation," the word identification WId, which may uniquely identify the keyword, is 7. The "decimation" has a position order of 2 in all the key words of the sentence 1, so the word position identification WPos is 2, and the sentence 1 has a position order of 1 in all the sentences of the network content, so the sentence position identification SId is 1, and finally one feature vector (7, 2, 1) of the key word "decimation" is generated.

For the keyword "user," 8 may be the term identification WId that uniquely identifies the keyword. The "decimation" has a position order of 3 in all the key words of the sentence 1, so the word position identification WPos is 3, and the sentence 1 has a position order of 1 in all the sentences of the network content, so the sentence position identification SId is 1, and finally one feature vector (8, 3, 1) of the key word "user" is generated.

For the key word "you," the word identification WId, which can uniquely identify the key word, is 1. The "prize" is 1 in the order of position in all key words of sentence 2, so the word position identification WPos is 1, and sentence 2 is 2 in the order of position in all sentences of the network content, so the sentence position identification SId is 2, and finally one feature vector (1, 1, 2) of the key word "you" is generated.

For the keyword "get," the word identification WId that can uniquely identify the keyword is 2. The "get" has a position order of 2 in all the key words of sentence 2, so the word position identity WPos is 2, and sentence 2 has a position order of 2 in all the sentences of the network content, so the sentence position identity SId is 2, and finally one feature vector (2, 2, 2) of the key word "get" is generated.

For the keyword "computer," 5 may be the word id WId that uniquely identifies the keyword. The "computer" has a position sequence of 4 in all the key words of the sentence 2, so the word position identity WPos is 4, and the sentence 2 has a position sequence of 2 in all the sentences of the web content, so the sentence position identity SId is 2, and finally one feature vector (5, 4, 2) of the key word "computer" is generated.

Therefore, the feature vector group of the network content pointed to by the network address B is { (6, 1, 1), (7, 2, 1), (8, 3, 1), (1, 1, 2), (2, 2, 2), (4, 3, 2), (5, 4, 2) }.

Calculating the characteristic value of each key word in the network content pointed by the network address A and the network content pointed by the network address B respectively as follows:

in the network content pointed by the network address a, the feature value of the key word "you" is 1x1+1x1+1x1+2x2 ═ 7, the feature value of the key word "get" is 2x2+1x1+2x2+2x2 ═ 13, the feature value of the key word "prize" is 3x3+1x1 ═ 10, the feature value of the key word "prize" is 3x3+2x2 ═ 13, the feature value of the key word "computer" is 4x4+2x2 ═ 20, and the feature values of the key words "system", "lottery" and "user" are all 0.

In the network content pointed by the network address B, the feature value of the key word "you" is 1x1+2x2 ═ 5, the feature value of the key word "get" is 2x2+2x2 ═ 8, the feature value of the key word "prize" is 0, the feature value of the key word "prize" is 3x3+2x2 ═ 13, the feature value of the key word "computer" is 4x4+2x2 ═ 20, the feature value of the key word "system" is 1x1+1x1 ═ 2, the feature value of the key word "decimation" is 2x2+1x1 ═ 5, and the feature value of the key word "user" is 3x3+1x1 ═ 10.

And then calculating the difference value of each key word in the network contents pointed by the network address A and the network address B as follows:

in the network contents pointed by the network address a and the network address B, the difference value of the key word "you" is 7-5-2, the difference value of the key word "get" is 13-8-5, the difference value of the key word "award" is 10-0-10, the difference value of the key word "award" is 13-0, the feature value of the key word "computer" is 20-0, the difference value of the key word "system" is 2-0-2, the difference value of the key word "lottery" is 5-0-5, and the difference value of the key word "user" is 10-0-10.

Therefore, the difference value between the network contents pointed by the network address a and the network contents pointed by the network address B is calculated to be (2+5+10+2+0+0+5+ 10)/8-4.25.

If a network address set only includes the network address B, the network content pointed to by the network address B, and the feature vector group of the network content, and has a difference threshold of 5, the difference value between the network content pointed to by the network address a and the network address set, that is, the difference value between the network content pointed to by the network address a and the network content pointed to by the network address B, is equal to 4.25. Obviously, if the difference value between the network content pointed by the network address a and the network address set is smaller than the difference threshold value of the network address set, it is determined that the network address a and the network address set point to similar network content, and the network content pointed by the network address A, A and the feature vector group of the network content may be stored in the network address set.

In this way, the network address analysis device 300 realizes the function of storing the network addresses pointing to the similar network contents into the same network address set, so that other malicious network addresses pointing to the similar network contents with one malicious network address can be provided for the malicious network address storage device 130, the samples of the malicious network address storage device 130 are greatly supplemented, the coverage rate of the malicious network address storage device 130 is remarkably improved, and a safer service can be provided for a user.

Fig. 4 illustrates a schematic diagram of supplementing a malicious network address according to an exemplary embodiment of the present invention. As shown in fig. 4, when a supervisor needs to collect malicious network addresses for actual fraud and other malicious activities on a user, the supervisor may first query the malicious network addresses in the malicious network address storage device 130. Then, the malicious network address storage device 130 sends the malicious network address that is not covered by the malicious network address storage device to the network address analysis system 200 of the present invention for query.

The network address analysis system 200 may query whether the malicious network address exists in the data storage device 220, and if so, send all contents (including all contents of the malicious network address) of the network address set to which the queried malicious network address belongs to the malicious network address storage device 130, so that the malicious network address storage device 130 analyzes and supplements the sample, thereby improving the coverage rate of the malicious network address storage device 130.

Fig. 5 shows a flow diagram of a network address analysis method 500 according to an example embodiment of the present invention. The network address analysis method 500 is suitable for being executed in a network address analysis system 200, the network address analysis system 200 includes a task assigning device 210, a data storage device 220, and a network address analysis device 300, the data storage device 220 stores analysis records of network addresses, and a network address set, the network address set includes at least one network address pointing to similar network content, and the network address analysis method 500 starts with step S510.

In step S510, a network address to be analyzed is received via the task assigning apparatus 210. After receiving the network address to be analyzed via the task assigning device 210, the data storage device 220 may be queried whether an analysis record of the network address exists, and if not, the network address may be sent to the network address analyzing device 300 according to an embodiment of the present invention. If the difference between the processing time in the analysis record and the current time exceeds the time threshold, the network address is sent to the network address analysis device 300.

Then, in step S520, the network address is processed by the network address analysis device 300, so as to store the network address in the data storage device 220 in the network address set pointing to the similar network content.

Fig. 6 shows a flow diagram of a method 600 for processing a network address according to an example embodiment of the present invention. The network address processing method 600 is suitable for being executed in the network address analyzing device 300, the network address analyzing device 300 is connected to the data storage device 220, the data storage device 220 stores at least one network address set, the network address set comprises at least one network address pointing to similar network content, and the network content and the feature vector set of the network content pointed by each network address in the at least one network address, and the network address processing method 600 starts with step S610.

The network content pointed to by the network address is acquired in step S610. Then, in step S620, the text content included in the web content is acquired. According to an embodiment of the present invention, step S620 may include: and acquiring pictures contained in the network content, and identifying text content contained in the pictures.

In step S630, the text content is divided into at least one sentence according to punctuation marks. Then, in step S640, for each sentence, at least one key word of the sentence is extracted.

Wherein, according to an embodiment of the present invention, the network address analyzing device 300 may be coupled to a part of speech querying device, and the step S640 may include: the sentence is divided into at least one word, and the part of speech of each word is acquired through the part of speech query equipment. And finally extracting words corresponding to the subject, the predicate and the object in at least one word as key words according to the part of speech of each word and a preset sentence structure rule.

After extracting the key words, in step S650, for each key word, a feature vector of the key word in the network content is generated according to the position of the key word in the text content, so that a feature vector group of the network content is formed by the feature vector of at least one key word.

According to an embodiment of the present invention, the feature vector may include a word identifier, a word position identifier, and a sentence position identifier of the keyword, where the word identifier uniquely identifies the keyword, and step S650 may include: and acquiring at least one sentence containing the key word, and generating a word position identifier and a sentence position identifier in a feature vector for each sentence according to the position of the key word in the sentence and the position of the sentence in the text content.

After the feature vector group is generated, in step S660, for each network address set in the data storage device 220, a difference value between the network content and the network address set is calculated according to the feature vector group.

According to an embodiment of the present invention, the step of calculating a difference value between the network content and the network address set may include: and calculating the difference value between the network content and the network content pointed by each network address in the network address set, and calculating the difference value between the network content and the network address set according to the difference value between the network content and the network content pointed by each network address.

According to another embodiment of the present invention, the step of calculating the difference value between the network content and the network content pointed to by each network address in the network address set may include: extracting word identifiers in the feature vector group of the network content pointed by the network address in the network content and network address set, and respectively calculating the feature values of the key words corresponding to the word identifiers in the network content and the network content pointed by the network address in the network address set for each word identifier. And finally, calculating the difference value of the key word in the two network contents according to the number of the extracted key words and the difference value of each key word.

Specifically, according to another embodiment of the present invention, the step of calculating the characteristic value of the key word corresponding to the word identifier in the network content may include: and searching whether at least one characteristic vector containing the word identification exists in the characteristic vector group of the network content, if so, calculating a characteristic value of a key word corresponding to the word identification in the network content according to the at least one characteristic vector, and if not, making the characteristic value of the key word corresponding to the word identification in the network content be 0.

The step of calculating the feature value of the key word corresponding to the word identifier in the network content according to at least one feature vector may further include: for each of the at least one feature vector, calculating a feature value of the key word in the feature vector according to the word position identifier and the sentence position identifier, and calculating a feature value of the key word in the network content according to the feature value of the key word in each feature vector.

After calculating the difference between the network content and the network address set, in step S670, it is determined whether the difference between the network content and the network address set is smaller than or equal to the difference between the network content and the network address set.

In step S680, if the difference value is smaller than or equal to the difference threshold of the network address set, it is determined that the network address and the network address set point to similar network content, and the network address, the network content pointed by the network address, and the feature vector set of the network content are stored in the network address set.

According to an embodiment of the present invention, the method 600 may further include the steps of: and if the difference value between the network content and the network address set is larger than the difference threshold value of the network address set, determining that the network address and the network address set do not point to similar network content. Further, if it is determined that the network address and all the network address sets in the data storage device do not point to similar network contents, a new network address set may be created in the data storage device 220, the network address, the network contents pointed by the network address, and the feature vector group of the network contents are stored in the new network address set, and the difference threshold of the new network address set is set as the initial difference threshold.

According to another embodiment of the present invention, the method 600 may further comprise the steps of: after storing the network address, the network content pointed to by the network address, and the set of feature vectors for the network content to the set of network addresses, the difference threshold for the set of network addresses may be updated. Specifically, for every two arbitrary network addresses in the network address set, a difference value of the network contents pointed to by the two network addresses is calculated, and then a difference threshold value of the network address set is calculated at least according to the difference value of the network contents pointed to by every two arbitrary network addresses.

According to another embodiment of the present invention, the step of calculating the difference threshold of the network address set according to at least the difference value of the network contents pointed to by every two network addresses may further include: and calculating a reference difference threshold of the network address set according to the difference value of the network contents pointed by every two network addresses, and calculating the difference threshold of the network address set according to the initial difference threshold, the reference difference threshold and respective weights, wherein the weights of the initial difference threshold and the reference difference threshold are respectively reduced and increased along with the increase of the number of the network addresses in the network address set.

The detailed explanation of the corresponding processing of each step has been already made in the detailed description of the principles of determining the network address analyzing system 200 and the network address analyzing device 300 with reference to fig. 1 to 5, and repeated descriptions are omitted here.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

The present invention may further comprise: a5, the method as in a4, wherein the step of calculating the difference value between the network content and the network address set comprises: calculating the difference value between the network content and the network content pointed by each network address in the network address set; and calculating the difference value between the network content and the network address set according to the difference value between the network content and the network content pointed by each network address. A6, the method as in a5, wherein the step of calculating the difference value between the network content and the network content pointed to by each network address in the network address set comprises: extracting the network content and word identifiers in the feature vector group of the network content pointed by the network address in the network address set; for each word identifier, respectively calculating a characteristic value of a key word corresponding to the word identifier in the network content and the network content pointed by the network address in the network address set; calculating the difference value of the key term in the two network contents according to the characteristic value of the key term in the two network contents; and calculating the difference value of the two network contents according to the number of the extracted key terms and the difference value of each key term. A7, the method as in a6, wherein the step of calculating the characteristic value of the corresponding key word of the word identification in the network content comprises: searching whether at least one characteristic vector containing the word identification exists in the characteristic vector group of the network content; if so, calculating a characteristic value of a key word corresponding to the word identifier in the network content according to at least one characteristic vector; and if not, enabling the characteristic value of the key word corresponding to the word identification in the network content to be 0. A8, the method as recited in a7, wherein the step of calculating the feature value of the key word corresponding to the word identifier in the network content according to at least one feature vector comprises: for each of the at least one feature vector, calculating a feature value of the key word in the feature vector according to the word position identifier and the sentence position identifier; and calculating the characteristic value of the key word in the network content according to the characteristic value of the key word in each characteristic vector. A9, the method of any one of A1-8, wherein the method further comprises the steps of: and if the difference value between the network content and the network address set is larger than the difference threshold value of the network address set, determining that the network address and the network address set do not point to similar network content. A10, the method of A9, wherein the method further comprises the steps of: if the network address and all network address sets in the data storage device do not point to similar network contents, a new network address set is created in the data storage device; storing the network address, the network content pointed by the network address and the feature vector group of the network content to a new network address set; and let the difference threshold for the new set of network addresses be the initial difference threshold. A11, the method of A10, wherein the method further comprises the steps of: after storing the network address, the network content pointed to by the network address, and the feature vector group of the network content into a network address set, updating a difference threshold of the network address set, including: calculating the difference value of the network contents pointed by the two network addresses for every two arbitrary network addresses in the network address set; and calculating the difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses. A12, the method as in a11, wherein the step of calculating the difference threshold of the network address set according to at least the difference value of the network contents pointed to by every arbitrary two network addresses further comprises: calculating a reference difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses; calculating a difference threshold of the network address set according to the initial difference threshold, the reference difference threshold and the respective weights; wherein the weights of the initial difference threshold and the reference difference threshold decrease and increase respectively with the increase of the number of network addresses in the network address set.

B14, the method according to B13, wherein the method further comprises the steps of: after receiving a network address to be analyzed via the task allocation device, querying whether an analysis record for the network address exists at the data storage device; if not, the network address is sent to the network address analysis equipment; if the difference between the processing time in the analysis record and the current time exceeds the time threshold, the network address is sent to the network address analysis equipment.

C19, the apparatus as in C18, wherein the set decision module is adapted to: calculating the difference value between the network content and the network content pointed by each network address in the network address set; and calculating the difference value between the network content and the network address set according to the difference value between the network content and the network content pointed by each network address. C20, the method as in C19, wherein the set decision module is further adapted to: extracting the network content and word identifiers in the feature vector group of the network content pointed by the network address in the network address set; for each word identifier, respectively calculating a characteristic value of a key word corresponding to the word identifier in the network content and the network content pointed by the network address in the network address set; calculating the difference value of the key term in the two network contents according to the characteristic value of the key term in the two network contents; and calculating the difference value of the two network contents according to the number of the extracted key terms and the difference value of each key term. C21, the apparatus as in C20, wherein the set decision module is further adapted to: searching whether at least one characteristic vector containing the word identification exists in the characteristic vector group of the network content; if so, calculating a characteristic value of a key word corresponding to the word identifier in the network content according to at least one characteristic vector; and if not, enabling the characteristic value of the key word corresponding to the word identification in the network content to be 0. C22, the apparatus as in C21, wherein the set decision module is further adapted to: for each of the at least one feature vector, calculating a feature value of the key word in the feature vector according to the word position identifier and the sentence position identifier; and calculating the characteristic value of the key word in the network content according to the characteristic value of the key word in each characteristic vector. C23, the device of any one of C15-22, wherein the set decision module is further adapted to: and if the difference value between the network content and the network address set is larger than the difference threshold value of the network address set, determining that the network address and the network address set do not point to similar network content. C24, the apparatus as in C23, wherein the set decision module is further adapted to: if the network address and all network address sets in the data storage device do not point to similar network contents, a new network address set is created in the data storage device; storing the network address, the network content pointed by the network address and the feature vector group of the network content to a new network address set; and let the difference threshold for the new set of network addresses be the initial difference threshold. C25, the apparatus as in C24, wherein the set decision module is further adapted to: after storing the network address, the network content pointed to by the network address, and the feature vector group of the network content into a network address set, updating a difference threshold of the network address set, including: calculating the difference value of the network contents pointed by the two network addresses for every two arbitrary network addresses in the network address set; and calculating the difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses. C26, the apparatus as in C25, wherein the set decision module is further adapted to: calculating a reference difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses; calculating a difference threshold of the network address set according to the initial difference threshold, the reference difference threshold and the respective weights; wherein the weights of the initial difference threshold and the reference difference threshold decrease and increase respectively with the increase of the number of network addresses in the network address set.

D28, the system of D27, wherein the task assigning device is further adapted to: after receiving a network address to be analyzed, querying whether an analysis record for the network address exists at the data storage device; if not, the network address is sent to the network address analysis equipment; if the difference between the processing time of the network address in the analysis record and the current time exceeds the time threshold, the network address is sent to the network address analysis equipment if the difference exceeds the time threshold.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A network address processing method adapted to be executed in a network address analyzing device connected to a data storage device, the data storage device storing at least one network address set including at least one network address pointing to similar network content and a network content, a feature vector set of the network content, to which each of the at least one network address points, the method comprising the steps of:

acquiring network content pointed by the network address;

acquiring text content contained in the network content;

dividing the text content into at least one sentence according to punctuation marks;

for each statement, extracting at least one key word of the statement;

for each key word, generating a feature vector of the key word in the network content so as to form a feature vector group of the network content by the feature vector of the at least one key word, wherein the feature vector comprises a word position identifier and a sentence position identifier of the key word, the word position identifier is generated according to the position of the key word in the sentence, and the sentence position identifier is generated according to the position of the sentence in the text content;

for each set of network addresses in the data storage device,

calculating the difference value between the network content and the network address set according to the feature vector group;

if the difference value is less than or equal to the difference threshold value of the network address set, determining that the network address and the network address set point to similar network content; and

and storing the network address, the network content pointed by the network address and the feature vector group of the network content into the network address set.

2. The method of claim 1, wherein the step of obtaining the text content included in the web content comprises:

acquiring pictures contained in the network content; and

identifying text content included in the picture.

3. The method of claim 1, wherein the network address analysis device is coupled to a part-of-speech query device, the extracting at least one keyword of the statement comprising:

dividing the sentence into at least one word;

acquiring the part of speech of each word through part of speech query equipment;

and extracting words corresponding to the subject, the predicate and the object in the at least one word as key words according to the part of speech of each word and a preset sentence structure rule.

4. The method of claim 1, wherein the feature vector includes a term identification, a term location identification, and a sentence location identification of a keyword, the term identification uniquely identifying the keyword, the step of generating at least one feature vector of the keyword in the network content comprising:

acquiring at least one sentence containing the key words;

and for each sentence, generating a word position identifier and a sentence position identifier in a feature vector according to the position of the key word in the sentence and the position of the sentence in the text content.

5. The method of claim 4, wherein the step of calculating the difference value between the network content and the set of network addresses comprises:

calculating the difference value between the network content and the network content pointed by each network address in the network address set;

and calculating the difference value between the network content and the network address set according to the difference value between the network content and the network content pointed by each network address.

6. The method of claim 5, wherein the step of calculating the difference value between the network content and the network content pointed to by each network address in the set of network addresses comprises:

extracting the network content and word identifiers in the feature vector group of the network content pointed by the network address in the network address set;

for each of the word identifications, the word identification,

respectively calculating characteristic values of key words corresponding to the word identifications in the network contents and the network contents pointed by the network addresses in the network address set;

calculating the difference value of the key term in the two network contents according to the characteristic value of the key term in the two network contents; and calculating the difference value of the two network contents according to the number of the extracted key terms and the difference value of each key term.

7. The method of claim 6, wherein the step of calculating the characteristic value of the corresponding key term in the network content for the term identification comprises:

searching whether at least one characteristic vector containing the word identification exists in the characteristic vector group of the network content;

if so, calculating a characteristic value of a key word corresponding to the word identifier in the network content according to at least one characteristic vector; and

if not, the characteristic value of the key word corresponding to the word identifier in the network content is made to be 0.

8. The method of claim 7, wherein the step of calculating the feature value of the keyword corresponding to the word identifier in the network content according to at least one feature vector comprises:

for each of the at least one feature vector, calculating a feature value of the key word in the feature vector according to the word position identifier and the sentence position identifier;

and calculating the characteristic value of the key word in the network content according to the characteristic value of the key word in each characteristic vector.

9. The method according to any one of claims 1-8, wherein the method further comprises the step of:

and if the difference value between the network content and the network address set is larger than the difference threshold value of the network address set, determining that the network address and the network address set do not point to similar network content.

10. The method of claim 9, wherein the method further comprises the steps of:

if the network address and all network address sets in the data storage device do not point to similar network contents, a new network address set is created in the data storage device;

storing the network address, the network content pointed by the network address and the feature vector group of the network content to a new network address set; and

let the difference threshold for the new set of network addresses be the initial difference threshold.

11. The method of claim 10, wherein the method further comprises the steps of:

after storing the network address, the network content pointed to by the network address, and the set of feature vectors for the network content to a set of network addresses,

updating a difference threshold for the set of network addresses, comprising:

calculating the difference value of the network contents pointed by the two network addresses for every two arbitrary network addresses in the network address set;

and calculating the difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses.

12. The method of claim 11, wherein the step of calculating the difference threshold of the set of network addresses according to at least the difference value of the network contents pointed to by every two network addresses further comprises:

calculating a reference difference threshold value of the network address set according to the difference value of the network contents pointed by every two network addresses;

calculating a difference threshold of the network address set according to the initial difference threshold, the reference difference threshold and the respective weights; wherein

The weights of the initial difference threshold and the reference difference threshold decrease and increase respectively with the increase of the number of network addresses in the network address set.

13. A network address analysis method adapted to be executed in a network address analysis system comprising a task assigning device, a data storage device and a network address analysis device, the data storage device storing an analysis record of network addresses, and a set of network addresses comprising at least one network address pointing to similar network content, the method comprising the steps of:

receiving a network address to be analyzed via a task allocation device;

via a network address analysis device, performing the method of any one of claims 1-12 to store the network address in a set of network addresses of a data storage device that point to similar network content.

14. The method of claim 13, wherein the method further comprises the steps of:

after receiving the network address to be analyzed via the task allocation device,

querying whether an analysis record for the network address exists at the data storage device;

if not, the network address is sent to the network address analysis equipment;

if the difference between the processing time in the analysis record and the current time exceeds the time threshold, the network address is sent to the network address analysis equipment.

15. A network address analysis device connected to a data storage device, the data storage device storing at least one set of network addresses including at least one network address pointing to similar network content and a set of feature vectors of network content, the set of feature vectors of network content being pointed to by each of the at least one network address, the device comprising:

the content acquisition module is suitable for acquiring the network content pointed by the received network address;

word extraction module adapted to

Acquiring text content contained in the network content;

for each statement, extracting at least one key word of the statement; feature generation module adapted to

For each key word, generating a feature vector of the key word in the network content according to the position of the key word in the text content, so as to form a feature vector group of the network content by the feature vector of the at least one key word, wherein the feature vector comprises a word position identifier and a sentence position identifier of the key word, the word position identifier is generated according to the position of the key word in the sentence, and the sentence position identifier is generated according to the position of the sentence in the text content; and

set decision module adapted to

For each set of network addresses in the data storage device,

16. The apparatus of claim 15, wherein the term extraction module is adapted to:

acquiring pictures contained in the network content; and

identifying text content included in the picture.

17. The device of claim 15, wherein the term extraction module is coupled to a part-of-speech query device, the term extraction module further adapted to:

dividing the sentence into at least one word;

18. The apparatus of claim 15, wherein the feature vector comprises a term identification, a term location identification, and a sentence location identification of a keyword, the term identification uniquely identifying the keyword, the feature generation module adapted to:

acquiring at least one sentence containing the key words;

19. The apparatus of claim 18, wherein the set decision module is adapted to:

20. The apparatus of claim 19, wherein the set decision module is further adapted to:

for each of the word identifications, the word identification,

21. The apparatus of claim 20, wherein the set decision module is further adapted to:

22. The apparatus of claim 21, wherein the set decision module is further adapted to:

23. The apparatus of any one of claims 15-22, wherein the set decision module is further adapted to:

24. The apparatus of claim 23, wherein the set decision module is further adapted to:

25. The apparatus of claim 24, wherein the set decision module is further adapted to:

updating a difference threshold for the set of network addresses, comprising:

26. The apparatus of claim 25, wherein the set decision module is further adapted to:

27. A network address analysis system comprising a data storage device, a task allocation device, and a network address analysis device as claimed in any one of claims 15 to 26, wherein

The data storage device is adapted to store an analysis record of network addresses, and a set of network addresses comprising at least one network address pointing to similar network content;

the task allocation device is suitable for receiving a network address to be analyzed and sending the network address to the network address analysis device;

the network address analysis device is adapted to process the network address, store the network address in a set of network addresses in the data storage device that point to similar network content.

28. The system of claim 27, wherein the task assigning device is further adapted to:

after receiving the network address to be analyzed,

if not, the network address is sent to the network address analysis equipment;

if the difference between the processing time of the network address in the analysis record and the current time exceeds the time threshold, the network address is sent to the network address analysis equipment if the difference exceeds the time threshold.