CN117591119B

CN117591119B - Mass APK source code feature extraction and similarity analysis method

Info

Publication number: CN117591119B
Application number: CN202311441226.0A
Authority: CN
Inventors: 段东圣; 侯炜; 张露晨; 佟玲玲; 段运强; 秦韬; 李美燕; 任博雅; 鲁睿; 张林波; 孙旷怡; 陈新兴; 张绪川; 王鹏
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-05-31
Anticipated expiration: 2043-11-01
Also published as: CN117591119A

Abstract

The invention relates to the technical field of software detection and discloses a method for extracting massive APK source code characteristics and analyzing the similarity, which comprises the steps of firstly inputting two APK files, extracting AndroidManifest files and localized language configuration files of an APK package by a source code analysis decompilation method, and extracting SMALI or JAVA source codes; identifying an APK core source code directory, a third party package directory and a system resource directory by a package name index, a starting class index and a fixed directory identification mode, and generating a source code tree; analyzing the file in the core source code catalog, calculating a file HASH, and extracting the character string declaration characteristic representation in the source code file as a weighting characteristic; and calculating the similarity conditions of two source code tree structures to be analyzed, and weighting the similarity of different degrees according to the types of the source code catalogues. The method reduces analysis resource investment and time consumption, improves accuracy of source code similarity analysis, and can realize high-performance analysis in a large-scale APK data analysis scene.

Description

Mass APK source code feature extraction and similarity analysis method

Technical Field

The invention relates to the technical field of software detection, in particular to a method for extracting massive APK source code characteristics and analyzing the similarity.

Background

In the technical field of APK (Android application package file) source code similarity analysis, remarkable development is achieved in recent years. The method specifically comprises the following steps:

1. code comparison algorithm: a more efficient and accurate code alignment algorithm was developed for comparing and analyzing similarities between APK source codes. These algorithms can identify differences between different versions of an application and identify code segments for reuse. (whether this can add citation sources, papers or patents, the following is the same)

2. Code clone detection: clone detection techniques can identify cloned code segments, i.e., duplicate codes, in APK source codes. This is important for code maintenance and reconfiguration, and can help developers reduce repetitive labor and improve code quality.

3. Feature extraction and representation: researchers have proposed different feature extraction and representation methods for capturing similar features in APK source code. For example, an AST (abstract syntax tree) is used to represent a code structure, and TF-IDF (word frequency-inverse document frequency) is used to represent keywords in a code.

4. Machine learning and deep learning: machine learning and deep learning techniques are applied to APK source code similarity analysis to improve accuracy of similarity matching and detection. For example, convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) may be used to learn the representation and similarity of APK source codes.

The prior related technology comprises the following steps:

And finally, according to a cosine similarity calculation method and the space vector, calculating the cosine similarity used for representing the similarity of the source codes, thereby helping a development team to identify the source codes of repeated or similar logic and providing a judgment basis for implementing scenes such as code reconstruction, service merging and the like.

1) At present, a main APP source code similarity analysis algorithm generally analyzes the similarity of APP packages and source codes through AndroidManifest file contents and a source code diff algorithm, obtains all source code files of the APP through decompilation, traverses each source code file and performs row-by-row comparison through the diff algorithm, and also performs associated identification on context contents, so that the algorithm based on the similarity analysis is generally applicable to content management scenes such as Git and SVN, and is not applicable to massive APP analysis scenes.

2) The current mainstream source code similarity analysis technology is mainly aimed at comparing the content similarity degree of two source code files, and lacks the condition of convenient and applicable in mass APP in practical service application. At present, most source code similarity analysis technologies face two source code files, and APP is a combination package of a large number of source code material files, so that the main technology for source code similarity analysis is difficult to directly and conveniently apply in the scene; in the similarity analysis process of the APK package, the file naming, variable naming and business logic of the APP source codes are changed due to the technologies of shell adding, confusion and the like, the content of the same source code is changed after the source code is subjected to shell adding, confusion and the like, and the source code is difficult to restore to the most original state after being processed by the technologies of reverse, shelling and the like, so that the stability of the output of the APP similarity analysis result is difficult to ensure by the similarity analysis technology of the source code. Aiming at the problems, a method for extracting and analyzing the characteristics of massive APK source codes is needed.

Disclosure of Invention

The invention aims to provide a method for extracting and analyzing characteristics and similarity of massive APK source codes. According to the invention, through extracting the management file of the APK, constructing the directory structure and the source code file map of the APK package, optimizing the analysis process, increasing specific item weight, reducing analysis resource investment and time consumption, improving the accuracy of source code similarity analysis, and realizing high-performance analysis on a large-scale APK data analysis scene.

The invention is realized in the following way:

the invention provides a method for extracting and analyzing characteristics and similarity of massive APK source codes, which comprises the following steps:

S ₁, firstly inputting two APK files, extracting AndroidManifest files and localized language configuration files of an APK package by a source code analysis decompilation method, and extracting SMALI or JAVA source codes; the AndroidManifest file extracted into the APK package through the source code analysis decompilation method firstly decompiles the APK through the existing APK analysis tools of apktool and jadx, if the APK is abnormal in the decompilation process, the APK information is extracted through the compression package decompression and then based on the android package body structure specification analysis mode, and finally decompilation is output to smal i source codes, and AndroidManifest files are obtained.

S ₂, identifying an APK core source code directory, a third party package directory and a system resource directory by a package name index, a starting class index and a fixed directory identification mode, and generating a source code tree; summarizing and constructing a catalog feature set based on a source code file organization mode of Android Studio mainstream IDE defaults and community consensus; analyzing the core code file catalogue layer by layer through a package name and a starting class structure, and analyzing the core code catalogue through a naming mode of the package name and the position of the starting class;

S ₃, analyzing the file in the core source code catalog, calculating a file HASH, and extracting the character string declaration characteristic representation in the source code file as a weighting characteristic; the AndroidManifest file includes APP name, package name, authority, attribute, service statement, and in step S ₃, the method specifically includes the following steps:

S _3.1: firstly, word segmentation is carried out on an input configuration file, the configuration file is segmented into individual vocabulary units according to attributes or names, and symbols or characters without specific meanings are filtered;

S _3.2: for each vocabulary unit, calculating a hash value of the vocabulary unit, multiplying the hash value by a weight value, setting the weight value according to the importance or frequency of the vocabulary, and extracting the characteristics;

S _3.3: combining the feature vectors of each vocabulary unit, and using a vector with a fixed length to represent the whole text for feature combination;

S _3.4: calculation SimHash: weighting and summarizing the combined feature vectors, and setting the corresponding position of each feature vector to be 1 if the value at the position is greater than 0; otherwise, setting the value to be-1, and finally obtaining a binary SimHash value;

S _3.5: comparing SIMHASH, comparing SimHash values of different texts, using hamming distance to measure similarity of two SimHash values.

The feature of the source code is obtained through the variables and attributes in smal i or java source codes, and the feature expression word set of the current source code file is formed by summarizing based on three element information such as name, type and occurrence frequency of the obtained variables.

The similarity calculation method for the source code feature representation is used for comparing the coverage degree of the variable intersection, calculating the occurrence frequency and the consistency of the types of the variables, and if the variable intersection exceeds a threshold value of 70%, considering that the current source code feature representation is similar.

S ₄, calculating the similarity condition of two source code tree structures to be analyzed, and weighting the similarity of different degrees according to the type of a source code catalog, wherein the weight +2, the third party package catalog +1 and the system resource catalog 0 of the core source code catalog similarity;

S ₅, calculating a source code file of an end node of each tree, wherein the source code file has consistent weight +2, and the source code file features represent similar weight +1; the method comprises the following steps:

S _5.1, firstly, calculating node similarity weight of a source code file of an end node of each tree;

s _5.2, judging whether the file HASH are the same, if so, weighting by +2, otherwise, weighting by +0;

s _5.3, judging whether the file characteristic representations are similar or not, if yes, weighting by +1, otherwise weighting by +0;

S _5.4, outputting weights.

S ₆, calculating the similarity condition of two trees, taking an average value according to bidirectional comparison, generating the similarity of a source code tree, specifically obtaining S1 by calculating the coverage rate of an A tree in a B tree, obtaining S2 by calculating the coverage rate of the B tree in the A tree, and finally calculating the similarity S of the output directory structure through (s1+s2)/2, wherein the similarity S is shown as a formula (1);

S ₇, analyzing AndroidManifest of two APP and similarity degree of localized Language configuration through SimHash algorithm, outputting similarity proportion through calculating Hamming distance, if the input two APP are A and B, androidManifest file is represented by C (Config), namely Ca and Cb, localized Language configuration is represented by L (Language), namely La and Lb, and outputting similarity attribute SC and SL through calculating AndroidManifest of the two APP and Language configuration file; as shown in the formula (2) -formula (3);

S _C = similarity (simhash (Ca), simhash (Cb)) formula (2)

S _L＝similarity(simhash(L_a),simhash(L_b)) type (3)

And S ₈, finally, calculating three data of tree structure similarity, androidManifest similarity and localization language configuration similarity according to the weighted sum of the ratio x, y and z, wherein x, y and z are weight coefficients of three similarity of tree structure similarity, androidManifest similarity and localization language configuration similarity, and setting the three coefficients according to the importance degree and participation degree of three similarity results in the final similarity calculation process. Calculating final APP similarity S through weighted summation; as shown in formula (4);

Further, the present invention provides a computer readable storage medium storing a computer program which when executed by a main controller implements a method as described in any one of the above.

Compared with the prior art, the invention has the beneficial effects that:

1. According to the invention, through extracting the management file of the APK, constructing the directory structure and the source code file map of the APK package, optimizing the analysis process, increasing specific item weight, reducing analysis resource investment and time consumption, improving the accuracy of source code similarity analysis, and realizing high-performance analysis on a large-scale APK data analysis scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings are also obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of the present invention for computing end nodes of each tree;

FIG. 3 is a code operation diagram of the present invention for obtaining variable names, types, and occurrence frequencies of a current source code file.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

Referring to FIGS. 1-3, a method for extracting and analyzing characteristics of massive APK source codes includes inputting two APK files, extracting AndroidManifest files and localized language configuration files of APK package by source code analysis decompilation method, and extracting SMALI or JAVA source codes; the AndroidManifest file extracted into the APK package through the source code analysis decompilation method firstly decompiles the APK through the existing APK analysis tools of apktool and jadx, if abnormality occurs in the decompilation process, APK information is extracted through compression package decompression and then based on the android package body structure specification analysis mode, and finally decompilation is output to the smali source code, and AndroidManifest file is obtained.

The feature of the source code is that the feature expression word set of the current source code file is formed by acquiring the variable and the attribute in the smali or java source code and summarizing based on three element information such as name, type, occurrence frequency and the like of the acquired variable. Characterization data samples such as table 1:

Table 1 features represent data samples

S _5.4, outputting weights.

S _C = similarity (simhash (Ca), simhash (Cb)) formula (2)

S _L＝similarity(simhash(L_a),simhash(L_b)) type (3)

In this embodiment, the present invention provides a computer-readable storage medium storing a computer program which, when executed by a main controller, implements a method as described in any one of the above.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting and analyzing characteristics of massive APK source codes is characterized by comprising the following steps: the method comprises the following steps:

s ₁, firstly inputting two APK files, extracting AndroidManifest files and localized language configuration files of an APK package by a source code analysis decompilation method, and extracting SMALI or JAVA source codes;

S ₂, identifying an APK core source code directory, a third party package directory and a system resource directory by a package name index, a starting class index and a fixed directory identification mode, and generating a source code tree;

S ₃, analyzing the file in the core source code catalog, calculating a file HASH, and extracting the character string declaration characteristic representation in the source code file as a weighting characteristic;

s ₄, calculating the similarity condition of two source code tree structures to be analyzed, and weighting the similarity of different degrees according to the types of source code catalogues, wherein the weight +2, the third party package catalogue +1 and the system resource catalogue 0 of the core source code catalogue similarity are carried out;

S ₅, calculating a source code file of an end node of each tree, wherein the source code file has consistent weight +2, and the source code file features represent similar weight +1;

S ₆, calculating the similarity condition of two trees, taking an average according to bidirectional comparison, generating the similarity of a source code tree, specifically obtaining S1 by calculating the coverage rate of an A tree in a B tree, obtaining S2 by calculating the coverage rate of the B tree in the A tree, and finally calculating the structural similarity S _T of the output tree by (s1+s2)/2, wherein the structural similarity is shown as a formula (1);

S ₇, outputting similarity proportions by calculating Hamming distances and outputting similarity attributes S _C and S _L according to the similarity degree of AndroidManifestH of two APP and localization language configuration analyzed by SimHash algorithm; as shown in the formula (2) -formula (3);

S _C = similarity (simhash (Ca), simhash (Cb)) formula (2)

S _L＝similarity(simhash(L_a),simhash(L_b)) type (3)

S ₈: finally, three data including tree structure similarity, androidManifest similarity and localization language configuration similarity are processed according to the proportion x: y: the z weighted summation calculates the weight coefficient of the three similarity of the tree structure similarity and AndroidManifest similarity and the localization language configuration similarity, and the final APP similarity S is calculated through the weighted summation; as shown in formula (4);

2. The method for extracting and analyzing characteristics and similarity of massive APK source codes according to claim 1, wherein in step S ₅, the method is specifically performed as follows:

S _5.1: firstly, calculating node similarity weight of a source code file of an end node of each tree;

S _5.2: judging whether the file HASH is the same, if so, weighting by +2, otherwise, weighting by +0;

S _5.3: judging whether the file characteristic representations are similar or not, if so, weighting by +1, otherwise, weighting by +0;

S _5.4: and outputting the weight.

3. The method for extracting and analyzing characteristics of massive APK source codes and the similarity according to claim 1, wherein in step S1, a AndroidManifest file of an APK package extracted by a source code analysis decompilation method is decompiled to an APK through apktool and jadx existing APK analysis tools, if an abnormality occurs in the decompilation process, APK information is extracted by compressing the package to decompress and then analyzing based on android package body structure specifications, and finally decompiled to a smali source code is output, and a AndroidManifest file is obtained.

4. The method for extracting and analyzing massive APK source code features and similarity according to claim 1, wherein in step S ₂, a catalog feature set is constructed by summarizing based on source code file organization modes of Android Studio mainstream IDE defaults and community consensus; and analyzing the core code file catalogue layer by layer through the package name and the starting class structure, and analyzing the core code catalogue through the naming mode of the package name and the position of the starting class.

5. The method for extracting and analyzing characteristics and similarity of massive APK source codes according to claim 1, wherein AndroidManifest files include APP names, package names, rights, attributes and service statements, and in step S ₃, the method is specifically executed as follows:

6. The method for extracting and analyzing characteristics of massive APK source codes and similarity according to claim 1, wherein the characteristics of the source codes are collected to form a characteristic representation word set of a current source code file by acquiring variables and attributes in smal i or java source codes based on naming, types and occurrence frequency three-element information of the acquired variables.

7. The method for extracting and analyzing massive APK source code features according to claim 6, wherein the similarity calculation method for the source code features is characterized in that the coverage degree of the intersection of variables is compared, the consistency of the occurrence frequency and the type of the variables is calculated, and if the intersection of the variables exceeds a threshold value of 70%, the current source code features are considered to be similar.

8. A computer readable storage medium storing a computer program, which when executed by a main controller implements the method of any of the preceding claims 1-7.