CN113946365A

CN113946365A - Page identification method and device, computer equipment and storage medium

Info

Publication number: CN113946365A
Application number: CN202010689776.4A
Authority: CN
Inventors: 白帆
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2022-01-18

Abstract

The application relates to a page identification method, a page identification device, computer equipment and a storage medium. The method comprises the following steps: acquiring original features of a target page extracted from a page code of the target page; performing feature dimension reduction processing on the original features of the target page to obtain page features after dimension reduction; inquiring page identification corresponding to the page characteristics to obtain suspected similar page identification; acquiring page original features corresponding to the suspected similar page identifications; respectively matching the original features of the pages with the original features of the target pages; and judging the page corresponding to the matched page original features as a page similar to the target page. By adopting the method, the efficiency of page identification can be improved.

Description

Page identification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a page identification method and apparatus, a computer device, and a storage medium.

Background

With the rapid development of front-end technology, pages under each platform are various, for example, a small program platform has a large number of small program pages. In many scenarios, pages similar to the target page need to be screened out from a large number of pages. For example, in a huge amount of pages, there are often some illegal copy pages, and the platform generally needs to identify a copy page similar to the target page from the huge amount of pages.

In the traditional method, similar pages are identified by adopting sequential matching, namely, the page codes of a target page are matched with massive page codes in a code library one by one. The matching process is relatively high in time consumption on average in a single time, and the number of page codes in a code library is very high, usually in hundred million, so that the traditional method for identifying the pages through sequential matching is low in efficiency.

Disclosure of Invention

In view of the above, it is necessary to provide a page identification method, apparatus, computer device and storage medium capable of improving efficiency.

A page identification method, characterized in that the method comprises:

acquiring original features of a target page extracted from a page code of the target page;

performing feature dimension reduction processing on the original features of the target page to obtain page features after dimension reduction;

inquiring page identification corresponding to the page characteristics to obtain suspected similar page identification;

acquiring page original features corresponding to the suspected similar page identifications;

respectively matching the original features of the pages with the original features of the target pages;

and judging the page corresponding to the matched page original features as a page similar to the target page.

A page identification apparatus, the apparatus comprising:

the characteristic acquisition module is used for acquiring original characteristics of the target page extracted from the page code of the target page;

the feature dimension reduction module is used for performing feature dimension reduction processing on the original features of the target page to obtain the page features after dimension reduction;

the index query module is used for querying the page identification corresponding to the page characteristics to obtain suspected similar page identification;

the matching module is used for acquiring page original features corresponding to the suspected similar page identifications; respectively matching the original features of the pages with the original features of the target pages; and judging the page corresponding to the matched page original features as a page similar to the target page.

In an embodiment, the index query module is further configured to query, by using the page feature as an index item, at least one page identifier corresponding to the page feature after the dimensionality reduction according to a preset mapping relationship between the page feature and the page identifier, so as to obtain a suspected similar page identifier.

In one embodiment, the index query module is further configured to locate a position corresponding to the page feature in a preset bitmap; each position in the bitmap uniquely records a page feature; based on a pre-constructed index, performing reverse index query by taking the positioned position as an index item to obtain at least one page identifier corresponding to the positioned position and obtain a suspected similar page identifier; wherein the index comprises a mapping relation between positions in the bitmap and page identifications; the page identifier having a mapping relation with the position refers to a page identifier corresponding to a page code having the page feature recorded in the position.

In one embodiment, the apparatus further comprises:

the index building module is used for extracting the features of the page codes in the page code library; performing feature dimension reduction processing on the extracted features of each page code; recording each page feature after dimensionality reduction in a corresponding position in a bitmap; and aiming at the position in the bitmap, determining a page code with the page characteristics recorded by the position, establishing a mapping relation between the position and a page identifier corresponding to the determined page code, and generating an index.

In one embodiment, the index building module is further configured to block the page identifier corresponding to the determined page code to obtain a plurality of page identifier blocks; connecting a plurality of page identification blocks to generate a page identification chain; and establishing a mapping relation between the position and the page identification chain, and generating an index.

In one embodiment, the apparatus further comprises:

the updating module is used for extracting the features of the updated page codes and performing feature dimension reduction processing when the page codes in the page code library are updated; updating the page features subjected to dimensionality reduction on the updated page code to the corresponding position in the bitmap; and in the index, establishing a mapping relation between the updated position and the page identifier corresponding to the updated page code.

In an embodiment, the index query module is further configured to perform reverse index query with the located position as an index item based on a pre-constructed index, and determine a page identifier chain corresponding to the located position; reading the page identifier in the page identifier block from the first page identifier block in the page identifier chain, and iteratively executing the step of reading the page identifier in the page identifier block for the next page identifier block in the page identifier chain until the iteration is stopped after the page identifier in the last page identifier block in the page identifier chain is read; and the page identifier in each page identifier block in the page identifier chain is a page identifier corresponding to a page code having the page feature recorded in the located position.

In an embodiment, the feature dimension reduction module is further configured to convert the feature dimension of the feature value and the range of the feature value into a preset dimension range and a preset feature value range, respectively, to obtain the page feature after dimension reduction.

In one embodiment, the feature dimension reduction module is further configured to perform a hash operation on each feature value in the original features of the target page according to at least one hash function, so as to obtain a hash value of each feature value in each hash operation; respectively selecting a characteristic value corresponding to the minimum hash value obtained by each hash operation from the characteristic values of multiple dimensions; and determining the page features of the original features of the target page after dimension reduction according to the selected feature values.

In one embodiment, the target page is a child application page; the page identifier is an identifier of a sub application page; the sub-application page is a page provided by a sub-application; the child application is a lightweight application that runs in the environment provided by the original parent application.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the page identification method, the page identification device, the computer equipment and the storage medium, feature dimension reduction processing is carried out on the original features of the target page, and suspected similar page identifications are inquired according to the features of the page subjected to dimension reduction. Because the complexity of the page features is reduced to a great extent by the page features after dimension reduction, reverse indexing is directly performed according to the page features after dimension reduction, suspected similar page identifications can be quickly inquired, and coarse matching screening is realized. And then, matching the original features of the pages corresponding to the identifications of the suspected similar pages with the original features of the target pages, namely, performing high-precision matching screening on the suspected similar pages screened out by rough matching through fine matching processing of the original features, and judging the pages corresponding to the original features of the pages passing the matching as the pages similar to the target pages. Through rough matching of descending and inverted index query and precision matching of original features of pages, pages similar to target pages can be quickly and accurately identified from massive pages.

Drawings

FIG. 1 is a diagram of an application environment of a page identification method in one embodiment;

FIG. 2 is a flowchart illustrating a page identification method according to an embodiment;

FIG. 3 is a diagram illustrating feature dimension reduction in one embodiment;

FIG. 4 is a diagram of building an index in one embodiment;

FIG. 5 is a diagram of a system architecture in one embodiment;

FIG. 6 is a simplified diagram of a feature update and matching process in one embodiment;

FIG. 7 is a block diagram of a page identification apparatus in one embodiment;

FIG. 8 is a block diagram of a page recognizing apparatus in another embodiment;

FIG. 9 is a block diagram of a page recognizing apparatus in still another embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The page identification method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

The technician may select a destination page via the terminal 102 and the terminal 102 notifies the server 104 of the selected destination page. The server 104 may obtain the original features of the target page extracted from the page code of the target page; performing feature dimension reduction processing on the original features of the target page to obtain page features after dimension reduction; inquiring page identification corresponding to the page characteristics to obtain suspected similar page identification; acquiring page original features corresponding to the suspected similar page identifications; respectively matching the original features of the pages with the original features of the target pages; and judging the page corresponding to the matched page original features as a page similar to the target page. Further, the server 104 may notify the terminal 102 of the identified page.

In one embodiment, as shown in fig. 2, a page identification method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

step 202, obtaining the original features of the target page extracted from the page code of the target page.

The target page is a reference page to be judged whether a similar page exists or not. That is, a page similar to the target page is searched from the page code library with the target page as a reference.

In one embodiment, the target page may be a page to be determined whether to be plagiarized. It can be understood that in the application scenario of identifying plagiarism pages, if similar pages exist in the target page, the target page is indicated to be plagiarism. And the page similar to the target page is the abnormal page suspected of plagiarism.

In one embodiment, the target page may be a sub-application page. The sub-application page is a page provided by the sub-application. The child application is a lightweight application that can be implemented in an environment provided by the parent application and can be used without downloading and installing. The parent application is an application program bearing the child application and provides an environment for implementing the child application. The parent application is a native application. A native application is an application that can run directly on an operating system.

The target page original characteristic is the page original characteristic of the target page. The page original features refer to page features which are directly obtained after feature extraction is carried out on page codes and are not subjected to feature dimension reduction processing. It can be understood that the original features of the target page are page features which are extracted from the page code of the target page and are not subjected to feature dimension reduction processing.

In one embodiment, the server may obtain the original features of the target page directly from a memory or storage system. Specifically, the server may perform feature extraction processing on each page code in the code library in advance to obtain page original features of each page code, and store the page original features of each page code in a local memory for storage, or send the page original features to the storage system for storage. Therefore, the server can directly obtain the original features of the target page corresponding to the target page from the memory or the storage system.

In one embodiment, the server may also perform a feature extraction process on the page code of the target page to extract the original features of the target page.

And 204, performing feature dimension reduction processing on the original features of the target page to obtain the page features after dimension reduction.

The page feature after dimensionality reduction is the page feature after the feature dimensionality of the original feature of the target page is reduced to a preset dimensionality range.

In one embodiment, the dimension-reduced page feature is a page feature in which the feature dimension of the original feature of the target page is reduced to a preset dimension range, and the size of the feature value is reduced to a preset feature value range. That is, compared with the original feature of the target page, the feature dimension of the page feature after dimensionality reduction is reduced, and the range of the feature value is reduced.

In one embodiment, each page feature after dimensionality reduction is a fixed-length page feature. Namely, the page features after dimensionality reduction meet the preset length.

Specifically, the server may perform feature dimension reduction processing on the original features of the target page using a linear or non-linear function.

In one embodiment, the target page original features have feature values of a plurality of feature dimensions. In this embodiment, the performing feature dimension reduction processing on the original feature of the target page to obtain the page feature after dimension reduction includes: and respectively converting the characteristic dimension of the characteristic value and the range of the characteristic value into a preset dimension range and a preset characteristic value range to obtain the page characteristic after dimension reduction.

In one embodiment, the server may perform feature dimension reduction on the original features of the target page using a minimum hash value algorithm.

In one embodiment, the converting the characteristic dimension of the characteristic value and the range of the characteristic value into a preset dimension range and a preset characteristic value range, respectively, and obtaining the page characteristic after the dimension reduction includes: performing hash operation on each characteristic value in the original characteristics of the target page according to at least one hash function respectively to obtain the hash value of each characteristic value in each hash operation; respectively selecting a characteristic value corresponding to the minimum hash value obtained by each hash operation from the characteristic values of multiple dimensions; and determining the page features of the original features of the target page after dimension reduction according to the selected feature values.

It can be understood that, when a plurality of hash functions are provided, the feature value corresponding to the minimum hash value obtained by performing hash operation using each hash function is used as the page feature after dimensionality reduction of the original feature of the target page.

And when the hash function is single, performing hash operation on each characteristic value in the original characteristic of the target page by using the hash function to obtain a hash value, and selecting a preset number of characteristic values as the page characteristic after dimension reduction of the original characteristic of the target page according to the sequence from small to large of the hash value.

In step 206, the page identifier corresponding to the page feature is queried to obtain a suspected similar page identifier.

In an embodiment, the server may use the reduced-dimension page feature as an index entry, and query a page identifier corresponding to the reduced-dimension page feature to obtain a suspected similar page identifier.

In one embodiment, step 206 includes: and inquiring at least one page identifier corresponding to the page features after dimension reduction according to a preset mapping relation between the page features and the page identifiers by taking the page features as index items to obtain suspected similar page identifiers.

The suspected similar page identifier is an identifier of a suspected similar page. It can be understood that the suspected similar pages are candidate pages, and then the final similar pages are identified from the candidate pages through steps 208-212.

In one embodiment, the page identification may be an identification of a sub-application page. The suspected similar page identification may be an identification of a sub-application page that is suspected similar.

Specifically, a mapping relationship between the dimensionality reduced page feature and the page identifier is pre-constructed in the server, and based on the mapping relationship, the server may query the page identifier corresponding to the dimensionality reduced page feature with the dimensionality reduced page feature as an index item to obtain a suspected similar page identifier. It should be noted that the page feature and the page identifier may be in a direct mapping relationship or an indirect mapping relationship (for example, a mapping relationship is established between the position of the page feature in the bitmap and the page identifier).

And the same page feature after dimensionality reduction has a mapping relation with one or more page identifications. It can be understood that the original features of the same or similar page codes are the same or similar, and the features of the page after dimensionality reduction are the same or similar, so that the page codes corresponding to the page identifiers having the mapping relationship with the same page features have a large degree of similarity.

It can be understood that, since the content length of each page code may be different (for example, the content length of the applet page code is substantially different), the number of feature dimensions in the extracted original features of the page and the range of the feature values are not controllable, such actual data is very unfavorable for query (for example, unfavorable for index construction) and may consume performance and a lot of query time. Therefore, the server can perform feature dimension reduction processing on the page original features of each page code in the code base, so as to convert the feature dimension number and the feature value range into a controllable range (namely a preset dimension range and a preset feature value range). Then, a mapping relation is established according to the reduced-dimension page features and the page identifiers corresponding to the page codes in the code base, and then based on the mapping relation, the searching complexity can be reduced, and the searching efficiency can be improved.

In other embodiments, the server may also match the dimensionality-reduced page features corresponding to the target page with the dimensionality-reduced page features corresponding to each page in the page code library, and determine the page identifier corresponding to the matched dimensionality-reduced page features as the suspected similar page identifier. It can be understood that, because the feature dimension and the feature value range of the page features after the dimension reduction both conform to a controllable uniform range, matching the page features after the dimension reduction can also improve the matching efficiency compared with the conventional method in which matching is directly performed according to different page original features with different lengths, and thus the page recognition efficiency can be improved.

And step 208, acquiring page original features corresponding to the suspected similar page identifications.

It can be understood that the original page features corresponding to the suspected similar page identifiers refer to original page features of page codes corresponding to the suspected similar page identifiers.

In one embodiment, the server may directly obtain the page original features corresponding to each suspected similar page identifier from the memory. It can be understood that the server extracts the page original features of each page code in the code base, and then performs feature dimension reduction processing on each page original feature, thereby constructing a mapping relationship between the page features and the page identifiers after dimension reduction. Therefore, in the process of constructing the mapping relationship between the page features and the page identifiers after the dimension reduction, the server already stores the page original features of each page code in the memory, so that the page original features corresponding to each suspected similar page identifier can be directly obtained from the memory.

In another embodiment, the server may retrieve from the storage system the page origin characteristics corresponding to each of the suspected similar page identifications. It is understood that the server may store the page original features of each page code to the storage system after extracting the page original features. In this case, the server may also obtain, from the storage system, the original features of the page corresponding to each suspected similar page identifier.

In other embodiments, the server may also perform feature extraction processing on the page code corresponding to each suspected similar page identifier, so as to extract the original feature of the page corresponding to each suspected similar page identifier. Here, the obtaining manner of the original features of the page corresponding to the suspected similar page identifier is not limited.

And step 210, respectively matching the original features of the pages with the original features of the target page.

It should be noted that the original features of each page in step 210 refer to original features of pages corresponding to the suspected similar page identifiers that are screened and queried.

Compared with the page features after dimensionality reduction, the page original features have more dimensionality and more range feature information, namely more comprehensive feature information, so that the page original features corresponding to the suspected similar page identifications are matched with the target page original features, which is equivalent to high-precision matching processing and belongs to a precise matching process. The fine matching means that all the complete page features are matched.

It can be understood that the suspected similar page identification is searched according to the page features after the dimension reduction, which belongs to the rough matching processing, namely, the rough matching is firstly performed through the steps 202 to 206 to preliminarily screen out the suspected similar page, then, the fine matching is performed through the steps 208 to 212, and the final page similar to the target page is screened out from the suspected similar page through the fine matching.

Specifically, the server may match the page original features corresponding to each suspected similar page identifier with the target page original features one by one, respectively. It can be understood that, since the suspected similar page identifier is obtained through the preliminary screening query, which is equivalent to that the coarse matching is performed for one round of screening, the page original features corresponding to the suspected similar page identifier and the target page original features are sequentially matched one by one, and the problems of too large data volume and too much time consumption are avoided.

And step 212, judging the page corresponding to the matched page original features as a page similar to the target page.

Specifically, in step 206, each suspected similar page identifies a corresponding page original feature, and the matching result between the page original feature and the target page original feature may include a match pass result and a match fail result. Therefore, when the page original features and the target page original features are matched and pass, the page corresponding to the page original features is similar to the target page, and the page is determined to be a page similar to the target page. And when the matching does not pass, the page corresponding to the original features of the page is not similar to the target page, and the page is judged not to be the page similar to the target page.

According to the page identification method, feature dimension reduction processing is carried out on the original features of the target page, and suspected similar page identifications are inquired according to the page features after dimension reduction. Because the complexity of the page features is reduced to a great extent by the page features after dimension reduction, reverse indexing is directly performed according to the page features after dimension reduction, suspected similar page identifications can be quickly inquired, and coarse matching screening is realized. And then, matching the original features of the pages corresponding to the identifications of the suspected similar pages with the original features of the target pages, namely, performing high-precision matching screening on the suspected similar pages screened out by rough matching through fine matching processing of the original features, and judging the pages corresponding to the original features of the pages passing the matching as the pages similar to the target pages. Through rough matching of descending and inverted index query and precision matching of original features of pages, pages similar to target pages can be quickly and accurately identified from massive pages.

In one embodiment, the step of querying at least one page identifier corresponding to the dimensionality-reduced page feature according to a preset mapping relationship between the page feature and the page identifier by using the page feature as an index entry to obtain the suspected similar page identifier includes: positioning the corresponding position of the page feature in a preset bitmap; each position in the bitmap uniquely records a page feature; based on the pre-constructed index, the positioned position is used as an index item to perform reverse index query, at least one page identifier corresponding to the positioned position is obtained, and suspected similar page identifiers are obtained.

Where a Bitmap, i.e., Bitmap, is a data structure representing a dense set (dense set) in a finite field, each element appears at least once, and no other data is associated with the element. It can be understood that, in the embodiment of the present application, each position in the bitmap uniquely records each dimension-reduced page feature corresponding to each page feature in the code library. Different dimensionality-reduced page features are recorded in different positions in the bitmap.

It can be understood that the server establishes an index in advance, and the index includes a mapping relationship between the positions in the bitmap and the page identifications. It is understood that the page identifier having a mapping relation with a location refers to a page identifier corresponding to a page code having a page feature recorded in the location. That is, the page identifier corresponding to the page code recording the same page feature has a mapping relationship with the position of the page feature in the bitmap.

Then, the server may locate, in a pre-established bitmap, a position corresponding to the reduced-dimension page feature. The server may perform reverse index query in the pre-constructed index by using the located position as an index entry, and find out at least one page identifier corresponding to the located position to obtain a suspected similar page identifier.

In an embodiment, the server may compare the page features recorded at each position in the bitmap with the dimensionality-reduced page features corresponding to the target page, determine a collision rate between the page features recorded at each position and the dimensionality-reduced page features corresponding to the target page, and select a position corresponding to the page feature with the largest collision rate as a position corresponding to the dimensionality-reduced page features corresponding to the target page in the bitmap.

This is illustrated in connection with fig. 3. Fig. 3 is a schematic illustration of dimension reduction of the minimum hash algorithm. Referring to fig. 3, a conventional method for directly comparing similarity of original features of pages is shown in a region 302, and it can be seen from 302 that the original features of the pages in the conventional method are all fingerprints with variable length. And comparing the similarity of the page A and the page B, calculating the similarity of the fingerprint a with the indefinite length of the page A and the fingerprint B with the indefinite length of the page B by a Jaccard similarity algorithm. However, in the scheme of the application, after the original features of the page are subjected to dimensionality reduction by adopting a minimum hash algorithm (Min-hash), the collision rate is calculated for the features of the page subjected to dimensionality reduction. Referring to fig. 3, the minimum hash fingerprint a 'with a fixed length is the page feature after dimension reduction of the fingerprint a with an indefinite length, and the minimum hash fingerprint b' with a fixed length is the page feature after dimension reduction of the fingerprint b with an indefinite length. Assuming that the target page is a page a and the bitmap records the dimension-reduced page features of a page B, the dimension-reduced fixed-length minimum hash fingerprint a ' (i.e., the dimension-reduced page features corresponding to the page a) corresponding to the page a may be compared with the dimension-reduced fixed-length minimum hash fingerprint B ' (i.e., the dimension-reduced page features corresponding to the page B) corresponding to the page B recorded in the bitmap, and a collision rate between the fixed-length minimum hash fingerprint a ' and the dimension-reduced minimum hash fingerprint B is calculated. That is, the minimum hash fingerprint with the largest collision rate is most similar to the minimum hash fingerprint a' with a fixed length, and thus, the position where the minimum hash fingerprint with the largest collision rate is located is the corresponding position of the page feature after the dimensionality reduction of the target page a in the bitmap.

In the embodiment, the search is performed through the bitmap structure, and the complexity of the search time can be stabilized at O (1), so that the search query efficiency is improved.

In one embodiment, the method further comprises the steps of: extracting the characteristics of the page codes in the page code library; performing feature dimension reduction processing on the extracted features of each page code; recording each page feature after dimensionality reduction in a corresponding position in a bitmap; and aiming at the position in the bitmap, determining a page code with the page characteristics recorded at the position, establishing a mapping relation between the position and a page identifier corresponding to the determined page code, and generating an index.

The page identifier corresponding to the page code refers to a unique identifier of a page represented by the page code.

Specifically, the server may determine, for each position in the bitmap, a page code having the page feature recorded at the position after dimensionality reduction, that is, determine a page code from which the page original feature of the page feature recorded at the position before dimensionality reduction is extracted. It will be appreciated that the page code having the page characteristics recorded at that location may be in multiple groups. A set of page codes characterizes a page. The server may establish a mapping relationship between the location and the page identifier corresponding to the determined page code, and generate the index. That is, the same location may correspond to one or more (i.e., at least two) page identifications.

It should be noted that the page identifiers corresponding to the same position in the bitmap may be located in the same block, or may be located in different blocks, and a chain structure is formed between different blocks. I.e. the same location in the bitmap, may correspond to a block with all corresponding page identifications. The same position in the bitmap may also correspond to a page identifier chain, each block in the page identifier chain has a page identifier corresponding to the position, and the set of page identifiers in each block in the page identifier chain is all page identifiers corresponding to the position.

In the embodiment, the bitmap structure is used for recording the reduced-dimension page features corresponding to each page code in the page code library, and then rapid search can be performed based on the bitmap structure, so that the search and query efficiency is improved.

In one embodiment, establishing a mapping relationship between the location and a page identifier corresponding to the determined page code, and generating the index includes: partitioning the page identification corresponding to the determined page code to obtain a plurality of page identification blocks; connecting a plurality of page identification blocks to generate a page identification chain; and establishing a mapping relation between the position and the page identification chain, and generating an index.

The page identification block is a block structure for storing the page identification. The page identification chain is a chain structure comprising a plurality of page identification blocks. The sum of the page identifiers in each page identifier block on the page identifier chain is all the page identifiers corresponding to the position. The page identifier corresponding to the position refers to a page identifier corresponding to a page code having the page characteristics recorded in the position. That is, the page id in each page id block is a page id corresponding to a page code having a page feature recorded in the located position.

In one embodiment, the same page identification block has the same or different number of page identifications.

In one embodiment, the server may establish a mapping relationship between information headers corresponding to the positions in the bitmap and corresponding page identifier chains, and generate the index.

FIG. 4 is a diagram illustrating the construction of an index, in one embodiment. Referring to FIG. 4, different positions in the bitmap record different dimensionality reduced page features. The information head of each position is respectively provided with a corresponding page identification chain. Each page identification chain has a plurality of page identification blocks thereon. The page identification block has a plurality of page identifications therein. Taking the first location 402 as an example, it is assumed that the first location corresponds to 100 page identifiers, and is divided into 2 page identifier blocks, and each page identifier block includes 50 page identifiers. It should be noted that the different headers in fig. 4 correspond to different page identifiers, and the numbers in page 1 … …, page N +50, etc. are only used to indicate the number of pages. For example, page 1 corresponding to the headers of 8 and 6 in the bitmap does not represent the same page.

In an embodiment, the performing, based on the pre-constructed index, an inverted index query using the located position as an index entry to obtain at least one page identifier corresponding to the located position, and obtaining the suspected similar page identifier includes: based on a pre-constructed index, performing reverse index query by taking the positioned position as an index item, and determining a page identification chain corresponding to the positioned position; reading the page identifier in the page identifier block from the first page identifier block in the page identifier chain, and iteratively executing the step of reading the page identifier in the page identifier block for the next page identifier block in the page identifier chain until the iteration is stopped after the page identifier in the last page identifier block in the page identifier chain is read.

Specifically, the server may perform reverse index query with the located position as an index item based on a pre-constructed index, and determine a page identification chain corresponding to the located position.

In one embodiment, the server may query the header corresponding to the located location based on a pre-constructed index and determine the body of information corresponding to the header. The information body comprises a page identification chain.

The server may read the page identifier in the page identifier block from the first page identifier block in the page identifier chain, and after reading the page identifier in one page identifier block, continue to read the page identifier in the next page identifier block in the page identifier chain, so as to iteratively read the page identifier in each page identifier block in the page identifier chain, thereby obtaining all page identifiers corresponding to the position.

As also illustrated in connection with fig. 4. Referring to fig. 4, assuming that the position 402 is located, a header corresponding to the position 402 may be queried first, and then an information body corresponding to the header is determined, that is, a page identifier chain L corresponding to the position 402 may be determined, and then 50 page identifiers in a first page identifier block on the page identifier chain L may be read first, and then 50 page identifiers in a second page identifier block may be read.

In the above embodiment, the constructed index structure is a block-chain structure, so that excessive jumps are avoided during query, and thus continuous data copying is performed more efficiently, that is, page identifiers are obtained more efficiently.

In one embodiment, the method further comprises: when the page code in the page code base is updated, performing feature extraction and feature dimension reduction processing on the updated page code; updating the page features subjected to dimensionality reduction on the updated page codes to corresponding positions in the bitmap; and in the index, establishing a mapping relation between the updated position and the page identifier corresponding to the updated page code.

It can be understood that updating to the corresponding position in the bitmap means that the page features of the updated page code after dimensionality reduction are recorded in the bitmap. It can be understood that when the page features after dimension reduction are changed, the corresponding positions of the page features in the bitmap are also changed.

This is now schematically illustrated in connection with the architecture diagram of fig. 5. Referring to fig. 5, the server includes a fine matching unit and a coarse matching unit. The fine matching unit is used for feature extraction (i.e. extracting page original features), feature storage (i.e. storing page original features in a memory), and fine matching (i.e. matching between page original features). And the rough matching unit is used for performing feature dimensionality reduction and performing index construction (namely constructing an index between the page features and the page identifiers after the dimensionality reduction) based on the page features after the dimensionality reduction. The rough matching unit is also used for feature updating (i.e. updating the reduced-dimension page features) and rough matching processing (i.e. processing for searching suspected similar page identifications according to the reduced-dimension page features). And the fine matching unit and the coarse matching unit in the server can respectively interact with the storage system.

The entire page identification process will now be briefly described with reference to the block diagram of fig. 5. And the fine matching unit can extract the features of each page code in the page code library and store the extracted original features of the page to be stored in the memory of the fine matching unit. The fine matching unit can input the extracted page original features into the coarse matching unit, so that the coarse matching unit can perform dimensionality reduction on the page original features, and construct an index according to the dimensionality reduced page features to establish an index mapping relationship between the dimensionality reduced page features and corresponding page identifiers. In addition, the fine matching unit in the server can also send the extracted page original features to the storage system, so that the storage system can store the page original features permanently.

When the page code in the page code library is updated, the fine matching unit can update the page original feature of the updated page code in the memory of the fine matching unit, send the updated page original feature to the storage system, and transmit an update request to the coarse matching unit, so that the coarse matching unit performs dimension reduction on the updated page original feature, and the updated dimension-reduced page characteristic is obtained. The rough matching unit may further update the constructed index according to the updated dimensionality-reduced page feature (for example, re-determining a position corresponding to the updated dimensionality-reduced page feature in the bitmap, and establishing a mapping relationship between the re-determined position and the page identifier to achieve the purpose of updating the index).

After the terminal notifies the target page, the fine matching unit in the server can acquire the target page original feature of the target page from the memory of the fine matching unit and send a matching request to the coarse matching unit, so that the coarse matching unit can reduce the dimension of the target page original feature to obtain the dimension-reduced page feature corresponding to the target page. The rough matching unit may perform rough matching processing according to the dimensionality-reduced page features corresponding to the target page, that is, perform reverse index query based on the constructed index, and query the suspected similar page identifier corresponding to the dimensionality-reduced page features. The rough matching unit may transmit the suspected similar page identifier found by the rough matching to the fine matching unit, so that the fine matching unit searches the original page features corresponding to the suspected similar page identifier in the memory, and sequentially matches the searched original page features with the original target page features one by one, thereby implementing fine matching processing. And then according to the fine matching result, screening out the page which is finally similar to the target page from the suspected similar pages corresponding to the suspected similar page identifications.

It can be understood that, in the case that the index needs to be reconstructed, such as when the entire server is restarted or the version is changed greatly, the rough matching unit may pull the stored original features of the page from the storage system, and reconstruct the index by performing dimension reduction processing on the original features. When the data is not changed greatly, the rough matching unit may update the index only according to the update request of the fine matching unit.

In the embodiment, the index structure is updated according to the feature update of the page code, so that the accuracy of the constructed index is improved, and the subsequent query processing based on the index is more accurate.

FIG. 6 is a simplified diagram of a feature update and matching process in one embodiment. Referring to fig. 6, in the updating process, the fine matching unit is responsible for extracting page original features from updated page codes in the page code library to implement feature extraction processing, and updating and storing the page original features to implement feature storage processing. And the fine matching unit transmits the extracted page original features to the coarse matching unit, so that the coarse matching unit updates and stores the index after reducing the dimension of the coarse matching unit. In the matching process, the fine matching unit extracts features of page codes of the target page, transmits the extracted original features of the target page to the coarse matching unit, and the coarse matching unit performs inverted index matching (namely, coarse matching is realized) after dimension reduction of the coarse matching unit so as to find out suspected similar page identifications corresponding to the dimension-reduced page features. The rough matching unit transmits the found suspected similar page identification to the fine matching unit, so that the fine matching unit pulls the page original characteristic corresponding to the found suspected similar page identification from the memory, matches the found page original characteristic with the target page original characteristic of the target page, namely, full characteristic matching (also belonging to fine matching) is realized, and finally, the page similar to the target page is identified according to the matching result.

The application also provides an application scene, and the application scene applies the page identification method. Specifically, the application of the page identification method in the application scenario is as follows:

the method comprises the steps that feature extraction is carried out on an applet page in an applet code library in advance in a server, feature dimension reduction processing is carried out on extracted original features of the applet page, and the applet page features after dimension reduction are generated. The server records the characteristic of the small program page after the dimension reduction in the corresponding position in the bitmap, and establishes the mapping relation between each position in the bitmap and the small program page identification, thereby establishing the index. The applet page identifier having a mapping relation with the position is the applet page identifier corresponding to the applet page code having the applet page feature recorded in the position.

The technical staff selects the target small program page (the small program page is the sub application page) to be identified whether to be plagiarized, and the terminal informs the server of the selected target small program page. And the server extracts the features of the target small program page to obtain the original features of the target page and performs dimensionality reduction on the original features of the target page. The server can determine the position corresponding to the feature of the applet page after dimension reduction in the bitmap, and then query the applet page identifier corresponding to the position from a pre-constructed index by taking the determined position as an index item to obtain the suspected similar applet page identifier. The server can obtain page original features respectively corresponding to the suspected similar small program page identifications, and match the obtained page original features with target page original features of the target small program page. The server may determine the applet page corresponding to the matched passed page original feature as an applet page similar to the target applet page. Thus, through coarse matching (i.e., dimension reduction and inverted index queries) and fine matching (i.e., matching between original features of the pages), an abnormal applet page (i.e., an applet page similar to the target applet page) suspected of plagiarism is quickly and accurately identified from the applet code library.

In the above embodiment, since the number of the applet pages in the applet page code library is in the billion level, and the content length of each applet page code is different, it takes a long time to match the applet codes in the library one to one by using the conventional method, the efficiency is low, and the consumption of system resources is large. However, according to the page identification method in the embodiment of the application, suspected similar small program pages can be quickly screened out through dimension reduction and inverted index query, and then the original page features of the suspected similar small program pages are precisely matched with the original target page features of the target small program pages, so that the final abnormal small program pages suspected of plagiarism can be quickly and accurately screened out, and consumption of system resources is greatly reduced.

It can be understood that the page identification method in the embodiment of the present application can also be applied in a scenario of a non-applet page. Namely, the method can be applied to any application scene for identifying similar pages. For example, web-side similar pages are identified from a code library.

It should be understood that, although the steps in the flowcharts of the present application are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps of the flow chart may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or in alternation with other steps or at least a portion of the steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided a page identification apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a feature obtaining module 702, a feature dimension reducing module 704, an index query module 706, and a matching module 708, wherein:

a feature obtaining module 702, configured to obtain an original feature of a target page extracted from a page code of the target page.

And the feature dimension reduction module 704 is configured to perform feature dimension reduction processing on the original features of the target page to obtain the page features after dimension reduction.

And an index query module 706, configured to query the page identifier corresponding to the page feature to obtain a suspected similar page identifier.

A matching module 708, configured to obtain page original features corresponding to each suspected similar page identifier; respectively matching the original features of each page with the original features of the target page; and judging the page corresponding to the matched page original features as a page similar to the target page.

In an embodiment, the index query module 706 is further configured to query, by using the page feature as an index entry, at least one page identifier corresponding to the page feature after the dimension reduction according to a preset mapping relationship between the page feature and the page identifier, so as to obtain a suspected similar page identifier.

In one embodiment, the index query module 706 is further configured to locate a position corresponding to the page feature in a preset bitmap; each position in the bitmap uniquely records a page feature; based on a pre-constructed index, performing reverse index query by taking the positioned position as an index item to obtain at least one page identifier corresponding to the positioned position and obtain a suspected similar page identifier; wherein, the index comprises the mapping relation between the position in the bitmap and the page identification; the page identifier having a mapping relation with the position refers to a page identifier corresponding to a page code having a page feature recorded in the position.

As shown in fig. 8, in one embodiment, the apparatus further comprises:

an index building module 701, configured to perform feature extraction on a page code in a page code library; performing feature dimension reduction processing on the extracted features of each page code; recording each page feature after dimensionality reduction in a corresponding position in a bitmap; and determining a page code with the page characteristics recorded by the position according to the position in the bitmap, establishing a mapping relation between the position and a page identifier corresponding to the determined page code, and generating an index.

In an embodiment, the index building module 701 is further configured to block the page identifier corresponding to the determined page code to obtain a plurality of page identifier blocks; connecting a plurality of page identification blocks to generate a page identification chain; and establishing a mapping relation between the position and the page identification chain, and generating an index.

As shown in fig. 9, in one embodiment, the apparatus further comprises:

the updating module 710 is configured to, when a page code in the page code library is updated, perform feature extraction and feature dimension reduction processing on the updated page code; updating the page features subjected to dimensionality reduction on the updated page codes to corresponding positions in the bitmap; and in the index, establishing a mapping relation between the updated position and the page identifier corresponding to the updated page code.

In an embodiment, the index query module 706 is further configured to perform reverse index query with the located position as an index item based on a pre-constructed index, and determine a page identifier chain corresponding to the located position; reading the page identifier in the page identifier block from the first page identifier block in the page identifier chain, and iteratively executing the step of reading the page identifier in the page identifier block for the next page identifier block in the page identifier chain until the iteration is stopped after the page identifier in the last page identifier block in the page identifier chain is read; the page identifier in each page identifier block in the page identifier chain is a page identifier corresponding to a page code having the page feature recorded in the located position.

In an embodiment, the feature dimension reduction module 704 is further configured to convert the feature dimension of the feature value and the range of the feature value into a preset dimension range and a preset feature value range, respectively, to obtain the page feature after dimension reduction.

In an embodiment, the feature dimension reduction module 704 is further configured to perform a hash operation on each feature value in the original features of the target page according to at least one hash function, respectively, to obtain a hash value of each feature value in each hash operation; respectively selecting a characteristic value corresponding to the minimum hash value obtained by each hash operation from the characteristic values of multiple dimensions; and determining the page features of the original features of the target page after dimension reduction according to the selected feature values.

In one embodiment, the target page is a child application page; the page identifier is an identifier of a sub application page; the sub-application page is a page provided by the sub-application; a child application is a lightweight application that runs in the environment provided by the original parent application.

For the specific definition of the page identification device, reference may be made to the above definition of the page identification method, which is not described herein again. The various modules in the page identification device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is for storing page identification data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a page recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A page identification method, characterized in that the method comprises:

2. The method of claim 1, wherein the querying the page identifier corresponding to the page feature to obtain a suspected similar page identifier comprises:

and inquiring at least one page identifier corresponding to the page features after dimension reduction according to a preset mapping relation between the page features and the page identifiers by taking the page features as index items to obtain suspected similar page identifiers.

3. The method according to claim 2, wherein the querying at least one page identifier corresponding to the page feature after the dimensionality reduction according to a preset mapping relationship between the page feature and the page identifier by using the page feature as an index entry to obtain the suspected similar page identifier comprises:

positioning the corresponding position of the page feature in a preset bitmap; each position in the bitmap uniquely records a page feature;

based on a pre-constructed index, performing reverse index query by taking the positioned position as an index item to obtain at least one page identifier corresponding to the positioned position and obtain a suspected similar page identifier;

wherein the index comprises a mapping relation between positions in the bitmap and page identifications; the page identifier having a mapping relation with the position refers to a page identifier corresponding to a page code having the page feature recorded in the position.

4. The method of claim 3, further comprising:

extracting the characteristics of the page codes in the page code library;

performing feature dimension reduction processing on the extracted features of each page code;

recording each page feature after dimensionality reduction in a corresponding position in a bitmap;

and aiming at the position in the bitmap, determining a page code with the page characteristics recorded by the position, establishing a mapping relation between the position and a page identifier corresponding to the determined page code, and generating an index.

5. The method of claim 4, wherein the establishing a mapping relationship between the location and a page identifier corresponding to the determined page code, and wherein generating an index comprises:

partitioning the page identification corresponding to the determined page code to obtain a plurality of page identification blocks;

connecting a plurality of page identification blocks to generate a page identification chain;

and establishing a mapping relation between the position and the page identification chain, and generating an index.

6. The method according to claim 5, wherein the performing reverse index query with the located position as an index item based on a pre-constructed index to obtain at least one page identifier corresponding to the located position, and obtaining the suspected similar page identifier comprises:

based on a pre-constructed index, performing reverse index query by taking the positioned position as an index item, and determining a page identification chain corresponding to the positioned position;

reading the page identifier in the page identifier block from the first page identifier block in the page identifier chain, and iteratively executing the step of reading the page identifier in the page identifier block for the next page identifier block in the page identifier chain until the iteration is stopped after the page identifier in the last page identifier block in the page identifier chain is read;

and the page identifier in each page identifier block in the page identifier chain is a page identifier corresponding to a page code having the page feature recorded in the located position.

7. The method of claim 4, further comprising:

when the page code in the page code base is updated, then

Carrying out feature extraction and feature dimension reduction processing on the updated page code;

updating the page features subjected to dimensionality reduction on the updated page code to the corresponding position in the bitmap;

and in the index, establishing a mapping relation between the updated position and the page identifier corresponding to the updated page code.

8. The method according to claim 1, wherein the target page original features have feature values of a plurality of feature dimensions;

the step of performing feature dimensionality reduction on the original features of the target page to obtain the page features subjected to dimensionality reduction comprises the following steps:

and respectively converting the characteristic dimension of the characteristic value and the range of the characteristic value into a preset dimension range and a preset characteristic value range to obtain the page characteristic after dimension reduction.

9. The method according to claim 8, wherein the converting the characteristic dimension of the characteristic value and the range of the characteristic value into a preset dimension range and a preset characteristic value range respectively to obtain the page characteristic after the dimension reduction comprises:

performing hash operation on each characteristic value in the original characteristics of the target page according to at least one hash function respectively to obtain the hash value of each characteristic value in each hash operation;

respectively selecting a characteristic value corresponding to the minimum hash value obtained by each hash operation from the characteristic values of multiple dimensions;

and determining the page features of the original features of the target page after dimension reduction according to the selected feature values.

10. The method according to any one of claims 1 to 9, wherein the target page is a sub-application page; the page identifier is an identifier of a sub application page; the sub-application page is a page provided by a sub-application; the child application is a lightweight application that runs in the environment provided by the original parent application.

11. An apparatus for page identification, the apparatus comprising:

12. The apparatus according to claim 11, wherein the index query module is further configured to query, by using the page feature as an index entry, at least one page identifier corresponding to the page feature after the dimensionality reduction according to a preset mapping relationship between the page feature and the page identifier, so as to obtain a suspected similar page identifier.

13. The apparatus according to claim 12, wherein the index query module is further configured to locate a corresponding position of the page feature in a preset bitmap; each position in the bitmap uniquely records a page feature; based on a pre-constructed index, performing reverse index query by taking the positioned position as an index item to obtain at least one page identifier corresponding to the positioned position and obtain a suspected similar page identifier; wherein the index comprises a mapping relation between positions in the bitmap and page identifications; the page identifier having a mapping relation with the position refers to a page identifier corresponding to a page code having the page feature recorded in the position.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.