CN117591119A

CN117591119A - Mass APK source code feature extraction and similarity analysis method

Info

Publication number: CN117591119A
Application number: CN202311441226.0A
Authority: CN
Inventors: 段东圣; 侯炜; 张露晨; 佟玲玲; 段运强; 秦韬; 李美燕; 任博雅; 鲁睿; 张林波; 孙旷怡; 陈新兴; 张绪川; 王鹏
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-02-23
Anticipated expiration: 2043-11-01
Also published as: CN117591119B

Abstract

The invention relates to the technical field of software detection and discloses a method for extracting massive APK source code characteristics and analyzing the similarity, which comprises the steps of firstly inputting two APK files, extracting android maniffest files and localized language configuration files of an APK package through a source code analysis decompilation method, and extracting SMALI or JAVA source codes; identifying an APK core source code directory, a third party package directory and a system resource directory by a package name index, a starting class index and a fixed directory identification mode, and generating a source code tree; analyzing the file in the core source code catalog, calculating a file HASH, and extracting the character string declaration characteristic representation in the source code file as a weighting characteristic; and calculating the similarity conditions of two source code tree structures to be analyzed, and weighting the similarity of different degrees according to the types of the source code catalogues. The method reduces analysis resource investment and time consumption, improves accuracy of source code similarity analysis, and can realize high-performance analysis in a large-scale APK data analysis scene.

Description

Mass APK source code feature extraction and similarity analysis method

Technical Field

The invention relates to the technical field of software detection, in particular to a method for extracting massive APK source code characteristics and analyzing the similarity.

Background

In the technical field of APK (Android application package file) source code similarity analysis, remarkable development is achieved in recent years. The method specifically comprises the following steps:

1. code comparison algorithm: a more efficient and accurate code alignment algorithm was developed for comparing and analyzing similarities between APK source codes. These algorithms can identify differences between different versions of an application and identify code segments for reuse. (whether this can add citation sources, papers or patents, the following is the same)

2. Code clone detection: clone detection techniques can identify cloned code segments, i.e., duplicate codes, in APK source codes. This is important for code maintenance and reconfiguration, and can help developers reduce repetitive labor and improve code quality.

3. Feature extraction and representation: researchers have proposed different feature extraction and representation methods for capturing similar features in APK source code. For example, an AST (abstract syntax tree) is used to represent a code structure, and TF-IDF (word frequency-inverse document frequency) is used to represent keywords in a code.

4. Machine learning and deep learning: machine learning and deep learning techniques are applied to APK source code similarity analysis to improve accuracy of similarity matching and detection. For example, convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) may be used to learn the representation and similarity of APK source codes.

The prior related technology comprises the following steps:

and finally, according to a cosine similarity calculation method and the space vector, calculating the cosine similarity used for representing the similarity of the source codes, thereby helping a development team to identify the source codes of repeated or similar logic and providing a judgment basis for implementing scenes such as code reconstruction, service merging and the like.

1) At present, the main APP source code similarity analysis algorithm generally analyzes the similarity of APP packets and source codes through AndroidManifest file content and source code diff algorithm, obtains all source code files of the APP through decompilation, traverses each source code file and performs line-by-line comparison through diff algorithm, and also performs associated identification on context content, so that the algorithm based on similarity analysis has higher calculation resources and slower efficiency, is generally applicable to content management scenes such as Git, SVN and the like, and is not applicable to mass APP analysis scenes.

2) The current mainstream source code similarity analysis technology is mainly aimed at comparing the content similarity degree of two source code files, and lacks the condition of convenient and applicable in mass APP in practical service application. At present, most source code similarity analysis technologies face two source code files, and APP is a combination package of a large number of source code material files, so that the main technology for source code similarity analysis is difficult to directly and conveniently apply in the scene; in the similarity analysis process of the APK package, the file naming, variable naming and business logic of the APP source codes are changed due to the technologies of shell adding, confusion and the like, the content of the same source code is changed after the source code is subjected to shell adding, confusion and the like, and the source code is difficult to restore to the most original state after being processed by the technologies of reverse, shelling and the like, so that the stability of the output of the APP similarity analysis result is difficult to ensure by the similarity analysis technology of the source code. Aiming at the problems, a method for extracting and analyzing the characteristics of massive APK source codes is needed.

Disclosure of Invention

The invention aims to provide a method for extracting and analyzing characteristics and similarity of massive APK source codes. According to the invention, through extracting the management file of the APK, constructing the directory structure and the source code file map of the APK package, optimizing the analysis process, increasing specific item weight, reducing analysis resource investment and time consumption, improving the accuracy of source code similarity analysis, and realizing high-performance analysis on a large-scale APK data analysis scene.

The invention is realized in the following way:

the invention provides a method for extracting and analyzing characteristics and similarity of massive APK source codes, which comprises the following steps:

S ₁ firstly, inputting two APK files, extracting an android management file and a localized language configuration file of an APK package by a source code analysis decompilation method, and extracting SMALI or JAVA source codes; the android management file extracted from the APK package through the source code analysis decompilation method is decompiled to the APK through an apktool and jadx existing APK analysis tool, if an abnormality occurs in the decompilation process, APK information is extracted through decompression of the compressed package and then based on the android package body structure specification analysis mode, decompiled to the smal i source code is finally output, and the android management file is obtained.

S ₂ Identifying APK core source by packet name index, starting class index and fixed directory identification modeCode catalogue, third party package catalogue and system resource catalogue, and generate source code tree; summarizing and constructing a catalog feature set based on a source code file organization mode of Android Studio mainstream IDE defaults and community consensus; analyzing the core code file catalogue layer by layer through a package name and a starting class structure, and analyzing the core code catalogue through a naming mode of the package name and the position of the starting class;

S ₃ analyzing files in a core source code catalog, calculating a file HASH, and extracting character string declaration characteristic representations in the source code file as weighting characteristics; the android management file includes APP name, package name, rights, attributes, service declarations, at step S ₃ Specifically, the method comprises the following steps:

S _3.1 : firstly, word segmentation is carried out on an input configuration file, the configuration file is segmented into individual vocabulary units according to attributes or names, and symbols or characters without specific meanings are filtered;

S _3.2 : for each vocabulary unit, calculating a hash value of the vocabulary unit, multiplying the hash value by a weight value, setting the weight value according to the importance or frequency of the vocabulary, and extracting the characteristics;

S _3.3 : combining the feature vectors of each vocabulary unit, and using a vector with a fixed length to represent the whole text for feature combination;

S _3.4 : computing SimHash: weighting and summarizing the combined feature vectors, and setting the corresponding position of each feature vector to be 1 if the value at the position is greater than 0; otherwise, setting the binary value as-1 to finally obtain a binary SimHash value;

S _3.5 : comparing SIMHASH values of different texts, and measuring the similarity of two SIMHASH values by using hamming distance.

The feature of the source code is that the feature expression word set of the current source code file is formed by acquiring variables and attributes in the smal i or java source code and summarizing based on three element information such as name, type and occurrence frequency of the acquired variables.

The similarity calculation method for the source code feature representation is used for comparing the coverage degree of the variable intersection, calculating the occurrence frequency and the consistency of the types of the variables, and if the variable intersection exceeds a threshold value of 70%, considering that the current source code feature representation is similar.

S ₄ Calculating the similarity conditions of two source code tree structures to be analyzed, and weighting the similarity of different degrees according to the types of source code catalogues, wherein the weight +2, the third party package catalogue +1 and the system resource catalogue 0 of core source code catalogue similarity;

S ₅ calculating a source code file of an end node of each tree, wherein the source code file has consistent weight +2, and the source code file features represent similar weight +1; the method comprises the following steps:

S _5.1 firstly, calculating node similarity weight of a source code file of an end node of each tree;

S _5.2 judging whether the file HASH is the same, if so, weighting by +2, otherwise, weighting by +0;

S _5.3 judging whether the file characteristic representations are similar or not, if so, weighting by +1, otherwise, weighting by +0;

S _5.4 and outputting the weight.

S ₆ Calculating the similarity condition of two trees, taking an average according to bidirectional comparison, generating the similarity of source code trees, specifically obtaining S1 by calculating the coverage rate of an A tree in a B tree, obtaining S2 by calculating the coverage rate of the B tree in the A tree, and finally calculating and outputting the similarity S of the directory structure by (s1+s2)/2, wherein the similarity S is shown as a formula (1);

S ₇ the AndroidManifest of two APP analyzed by SimHash algorithm, the similarity degree of localization Language configuration, output the similarity proportion by calculating the Hamming distance, if the two APP input are A and B, then the AndroidManifest file is represented by C (Config), namely Ca and Cb, the localization Language configuration is represented by L (Language), namely La, lb, output the similarity attribute SC and SL by calculating the AndroidManifest and the Language configuration file of the two APP; as shown in (2) -formula (3)；

S _C Similarity (Ca), simhash (Cb)) formula (2)

S _L ＝similarity(simhash(L _a )，simhash(L _b ) Arbitrary (3)

S ₈ And finally, calculating three data of tree structure similarity, android management similarity and localization language configuration similarity according to the ratio of x to y to z, wherein x, y and z are weight coefficients of three similarity of tree structure similarity, android management similarity and localization language configuration similarity, and setting the three coefficients according to the importance and participation degree of the three similarity results in the final similarity calculation process. Calculating final APP similarity S through weighted summation; as shown in formula (4);

further, the present invention provides a computer readable storage medium storing a computer program which when executed by a main controller implements a method as described in any one of the above.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, through extracting the management file of the APK, constructing the directory structure and the source code file map of the APK package, optimizing the analysis process, increasing specific item weight, reducing analysis resource investment and time consumption, improving the accuracy of source code similarity analysis, and realizing high-performance analysis on a large-scale APK data analysis scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings are also obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of the present invention for computing end nodes of each tree;

FIG. 3 is a code operation diagram of the present invention for obtaining variable names, types, and occurrence frequencies of a current source code file.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

Referring to fig. 1-3, a method for extracting and analyzing characteristics of massive APK source codes and similarity, S ₁ Firstly, inputting two APK files, extracting an android management file and a localized language configuration file of an APK package by a source code analysis decompilation method, and extracting SMALI or JAVA source codes; the android management file extracted from the APK package through the source code analysis decompilation method is decompiled to the APK through an apktool and jadx existing APK analysis tool, if an abnormality occurs in the decompilation process, APK information is extracted through compression package decompression and then based on the android package body structure specification analysis mode, decompiled to the smali source code is finally output, and the android management file is obtained.

S ₂ Identifying APK core source code by packet name index, starting class index and fixed directory identification modeRecording, third party package catalogs and system resource catalogs, and generating a source code tree; summarizing and constructing a catalog feature set based on a source code file organization mode of Android Studio mainstream IDE defaults and community consensus; analyzing the core code file catalogue layer by layer through a package name and a starting class structure, and analyzing the core code catalogue through a naming mode of the package name and the position of the starting class;

The feature of the source code is that the feature expression word set of the current source code file is formed by acquiring the variable and the attribute in the smali or java source code and summarizing based on three element information such as name, type, occurrence frequency and the like of the acquired variable. Characterization data samples such as table 1:

table 1 features represent data samples

S _5.4 and outputting the weight.

S ₇ the AndroidManifest of two APP analyzed by SimHash algorithm, the similarity degree of localization Language configuration, output the similarity proportion by calculating the Hamming distance, if the two APP input are A and B, then the AndroidManifest file is represented by C (Config), namely Ca and Cb, the localization Language configuration is represented by L (Language), namely La, lb, output the similarity attribute SC and SL by calculating the AndroidManifest and the Language configuration file of the two APP; as shown in the formula (2) -formula (3);

S _C similarity (Ca), simhash (Cb)) formula (2)

S _L ＝similarity(simhash(L _a )，simhash(L _b ) Arbitrary (3)

in this embodiment, the present invention provides a computer-readable storage medium storing a computer program which, when executed by a main controller, implements a method as described in any one of the above.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting and analyzing characteristics of massive APK source codes is characterized by comprising the following steps: the method comprises the following steps:

S ₁ firstly, inputting two APK files, extracting an android management file and a localized language configuration file of an APK package by a source code analysis decompilation method, and extracting SMALI or JAVA source codes;

S ₂ identifying an APK core source code directory, a third party package directory and a system resource directory by a package name index, a starting class index and a fixed directory identification mode, and generating a source code tree;

S ₃ analyzing files in a core source code catalog, calculating a file HASH, and extracting character string declaration characteristic representations in the source code file as weighting characteristics;

S ₅ calculating a source code file of an end node of each tree, wherein the source code file has consistent weight +2, and the source code file features represent similar weight +1;

S ₇ the similarity degree of AndroidManifest and localized language configuration of two APP analyzed by SimHash algorithm is output by calculating Hamming distance to output similarity proportion, and similarity attributes SC and SL are output; as shown in the formula (2) -formula (3);

S _C similarity (Ca), simhash (Cb)) formula (2)

S _L ＝similarity(simhash(L _a )，simhash(L _b ) Arbitrary (3)

S ₈ : finally, three data of tree structure similarity, android management similarity and localization language configuration similarity are weighted and summed according to the proportion of x to y to z to calculate output similarity, and x, y and z are weight coefficients of three similarity of tree structure similarity, android management similarity and localization language configuration similarity, and final APP similarity S is calculated through weighted and summed; as shown in formula (4);

2. the method for extracting and analyzing characteristics and similarity of massive APK source codes according to claim 1, wherein in step S ₅ Specifically, the method comprises the following steps:

S _5.1 : firstly, calculating node similarity weight of a source code file of an end node of each tree;

S _5.2 : judging whether the file HASH is the same, if so, weighting by +2, otherwise, weighting by +0;

S _5.3 : judging whether the file characteristic representations are similar or not, if so, weighting by +1, otherwise, weighting by +0;

S _5.4 : and outputting the weight.

3. The method for extracting and analyzing characteristics of massive APK source codes and the similarity according to claim 1 is characterized in that in step S1, an android maniffect file of an APK package is extracted through a source code analysis decompilation method, APK is decompiled through an apktool and jadx existing APK analysis tool, if abnormality occurs in the decompilation process, APK information is extracted through compression package decompression and then based on an android package body structure specification analysis mode, decompiled to a smali source code is finally output, and the android maniffect file is obtained.

4. The method for extracting and analyzing characteristics and similarity of massive APK source codes according to claim 1, wherein in step S ₂ Summarizing and constructing a catalog feature set based on a source code file organization mode of Android Studio mainstream IDE defaults and community consensus; and analyzing the core code file catalogue layer by layer through the package name and the starting class structure, and analyzing the core code catalogue through the naming mode of the package name and the position of the starting class.

5. The method for extracting and analyzing characteristics and similarity of massive APK source codes according to claim 1, wherein the android management file includes APP name, package name, authority, attribute, service statement, in step S ₃ Specifically, the method comprises the following steps:

6. The method for extracting and analyzing characteristics of massive APK source codes and similarity according to claim 1, wherein the characteristics of the source codes are summarized to form a characteristic representation word set of a current source code file by acquiring variables and attributes in smal i or java source codes based on three element information such as name, type and occurrence frequency of the acquired variables.

7. The method for extracting and analyzing massive APK source code features according to claim 6, wherein the similarity calculation method for the source code features is characterized in that the coverage degree of the intersection of variables is compared, the consistency of the occurrence frequency and the type of the variables is calculated, and if the intersection of the variables exceeds a threshold value of 70%, the current source code features are considered to be similar.

8. A computer readable storage medium storing a computer program, which when executed by a main controller implements the method of any of the preceding claims 1-7.