CN110610084A - Dex file-based sample maliciousness determination method and related device - Google Patents

Dex file-based sample maliciousness determination method and related device Download PDF

Info

Publication number
CN110610084A
CN110610084A CN201810617936.7A CN201810617936A CN110610084A CN 110610084 A CN110610084 A CN 110610084A CN 201810617936 A CN201810617936 A CN 201810617936A CN 110610084 A CN110610084 A CN 110610084A
Authority
CN
China
Prior art keywords
sample
character string
determining
dex file
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810617936.7A
Other languages
Chinese (zh)
Other versions
CN110610084B (en
Inventor
严丽芳
高坤
邰靖宇
潘宣辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Antian Information Technology Co Ltd
Original Assignee
Wuhan Antian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Antian Information Technology Co Ltd filed Critical Wuhan Antian Information Technology Co Ltd
Priority to CN201810617936.7A priority Critical patent/CN110610084B/en
Publication of CN110610084A publication Critical patent/CN110610084A/en
Application granted granted Critical
Publication of CN110610084B publication Critical patent/CN110610084B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a sample malice judgment method and device based on a Dex file, computer equipment and a computer storage medium, and relates to the technical field of computer network security. The sample maliciousness determination method comprises the following steps: determining a character string length vector of a first sample Dex file, wherein elements in the character string length vector are the length of each character string in the Dex file in sequence, the length of the character string length vector is the number of character strings contained in the Dex file, and the first sample is a non-shell sample or a sample processed by a non-famous shell adding tool; determining an alternative reference sample; determining the similarity between each alternative reference sample and a sample to be judged; judging the maliciousness of the sample to be judged by a k-adjacent algorithm. The invention utilizes the characteristics that the packet names and the categories are changed after confusion but the length vector of the character string is not changed to detect the maliciousness of the sample, can resist the confusion means, has high detection accuracy, and further ensures the accuracy of the judgment result by utilizing the k-neighbor algorithm.

Description

Dex file-based sample maliciousness determination method and related device
Technical Field
The invention relates to the technical field of computer network security, in particular to a sample maliciousness judgment method and device based on a Dex file, computer equipment and a computer storage medium.
Background
Nowadays, with the popularization of mobile devices, mobile information security is more and more emphasized by the public, and mobile malicious codes are taken as an important mobile threat means, which seriously puzzles the asset and data security of each mobile user and actually needs to perform malicious judgment on a sample.
The existing sample malice judging method only uses the class name and the method name of a sample to compare with the similarity of a known malicious sample library so as to judge the malice of the sample. However, since a large amount of malicious codes are resistant to detection, obfuscation tools are generally used to protect own codes from detection, a sample batch generation tool is commonly used to generate a large amount of structures and functions unchanged, but md5 (file signatures, similar to id cards) are applied differently, which results in that the same type of sample is actually generated, but due to obfuscation, the type names, the method names, and the like become inconsistent, and thus the determination work of the malicious codes is interfered. For example, before confusion, the packet name is happy and the class name is time, and a malicious code manufacturer uses confusion means to confuse the packet name happy into aaaaa and the class name time into cccc, but the existing detection means matches the two class names and the packet name, so that the sample is considered not to belong to the malicious class with the packet name being happy, and the judgment is wrong.
Disclosure of Invention
The embodiment of the invention provides a sample malice judging method and device based on a Dex file, computer equipment and a computer storage medium, so that sample malice judgment is not affected by confusion, and the judgment result is more accurate.
In a first aspect, an embodiment of the present invention provides a sample maliciousness determination method based on a Dex file.
Specifically, the sample maliciousness determination method includes:
determining a character string length vector of a Dex file of a first sample, wherein elements in the character string length vector are the length of each character string in the Dex file in sequence, and the length of the character string length vector is the number of the character strings contained in the Dex file, wherein the first sample is a non-shell sample or a sample processed by a non-known shell adding tool;
determining an alternative reference sample, wherein the alternative reference sample is a sample with the length of the character string length vector in a known sample library being equal to the length of the character string length vector of a sample to be judged;
determining the similarity of each alternative reference sample and the sample to be judged according to the character string length vector of each alternative reference sample and the character string length vector of the sample to be judged;
and judging the maliciousness of the sample to be judged through a k-neighborhood algorithm according to the similarity of each alternative reference sample and the sample to be judged.
In a second aspect, an embodiment of the present invention provides a sample maliciousness determination apparatus based on a Dex file.
Specifically, the sample maliciousness determination apparatus includes:
the device comprises a character string length vector determining module, a character string length vector determining module and a data processing module, wherein the character string length vector determining module is used for determining a character string length vector of a Dex file of a first sample, elements in the character string length vector are the length of each character string in the Dex file in sequence, the length of the character string length vector is the number of the character strings contained in the Dex file, and the first sample is a non-shell sample or a sample processed by a non-known shell adding tool;
the candidate reference sample determining module is used for determining a candidate reference sample, wherein the candidate reference sample is a sample with the length of the character string length vector in a known sample library equal to the length of the character string length vector of a sample to be judged;
the similarity determining module is used for determining the similarity between each candidate reference sample and the sample to be judged according to the character string length vector of each candidate reference sample and the character string length vector of the sample to be judged;
and the maliciousness judging module is used for judging the maliciousness of the sample to be judged through a k-neighborhood algorithm according to the similarity of each alternative reference sample and the sample to be judged.
In a third aspect, an embodiment of the present invention provides a computer device.
Specifically, the computer device includes:
a processor; and
a memory for storing a computer program for executing a computer program,
the processor is configured to execute a computer program stored in the memory to implement the method for determining sample maliciousness based on a Dex file according to the first aspect.
In a third aspect, embodiments of the present invention provide a computer storage medium.
Specifically, the computer storage medium stores therein a computer program, and the computer program, when executed by a processor, implements the method for determining sample maliciousness based on a Dex file according to the first aspect.
When the sample maliciousness is judged, matching is firstly carried out according to a sample to be judged and a known sample library, known samples with different lengths of character string length vectors and character string length vectors of the sample to be judged are filtered, and then K adjacent samples are found out from the known samples with the same length of the character string length vectors through a K adjacent algorithm, wherein most of the K samples belong to the same category, and the sample to be judged belongs to the category. Compared with the existing sample malice judging scheme, the scheme uses the class name and method name matching strategy, can better resist confusion means, and improves the accuracy of sample judgment. For example, according to the example in the background art as described above, a packet name is happy and a class name is time before a sample is not confused, and a malicious code maker uses obfuscation means to obfuscate the packet name happy into aaaaa and the class name time into cccc, but the existing detection means matches the two class names and the packet name, and then considers that the sample does not belong to a malicious class whose packet name is happy class and class name is time class, and thus determines that there is an error, whereas in the present invention, similarity of vector lengths is compared, for example, a string length vector of the sample is [5,4] before being not obfuscated, and after obfuscation, although the packet name becomes aaaa and the class becomes cccc, the string length vector is also [5,4], so when determining, the result is not affected by obfuscation, and thus accuracy is higher. In addition, the character string length vector of the sample is subjected to similarity matching with the character string length vector of the known sample according to a k-neighbor algorithm, the accuracy of the sample maliciousness judgment result is further ensured by the k-neighbor algorithm, and the workload of manual analysis can be effectively reduced.
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a sample maliciousness determination method based on a Dex file according to embodiment 1 of the present invention;
FIG. 1A is a flowchart illustrating the method of FIG. 1 for determining a string length vector;
FIG. 1B is a detailed flowchart of the number of strings and the base address of the string index in the method shown in FIG. 1A;
fig. 2 is a flowchart of a sample maliciousness determination method based on a Dex file according to embodiment 2 of the present invention;
FIG. 2A is a diagram illustrating a portion of a Dex file of a sample to be determined in the method shown in FIG. 2;
FIG. 2B is a diagram illustrating another part of the content of the Dex file;
fig. 3 is a schematic diagram of a sample maliciousness determination apparatus based on a Dex file according to an embodiment 1 of the present invention;
FIG. 3A is a diagram illustrating a string length vector determination module of the sample maliciousness determination apparatus shown in FIG. 3;
fig. 3B is a schematic diagram of a character string index address determination submodule in the character string length vector determination module shown in fig. 3A.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the reference numbers such as 102, 104, etc. merely being used to distinguish between the various operations, and the reference numbers themselves do not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the examples of the present invention, are within the scope of the present invention.
[ METHOD EXAMPLE 1]
Fig. 1 is a flowchart of a sample maliciousness determination method based on a Dex file according to embodiment 1 of the present invention. Referring to fig. 1, in the present embodiment, the method includes:
step S11, determining a character string length vector of a Dex file of a first sample, wherein elements in the character string length vector are the length of each character string in the Dex file in sequence, and the length of the character string length vector is the number of character strings contained in the Dex file, and the first sample is a non-shell sample or a sample processed by a non-known shell adding tool;
specifically, the Dex file of the first sample is a class.
It should be noted that the "first sample" is used to collectively refer to any sample, and the alternative reference sample, the known sample, and the sample to be determined in the following steps may all be the "first sample".
The first sample may be a sample that is not shelled or may be a sample that has been processed via a non-well-known shelling tool. Here, the famous shelling tool is, for example, a shelling tool provided by a famous manufacturer such as Baidu or Tencent, and in the embodiment, malicious judgment is performed on a sample processed by a non-famous shelling tool by using a character string length vector, so that a sample processed by the famous shelling tool which is the same as the character string length vector and has universality is eliminated, and the accuracy of malicious judgment is ensured.
Step S12, determining a candidate reference sample, wherein the candidate reference sample is a sample with the length of the character string length vector in the known sample library equal to the length of the character string length vector of the sample to be judged;
specifically, whether the length of the character string length vector in the Dex file of all the known samples in the known sample library is equal to the length of the character string length vector of the Dex file in the sample to be determined is respectively determined, if so, the known sample is the alternative reference sample, and if not, the known sample is not the alternative reference sample.
Step S13, determining the similarity between each candidate reference sample and the sample to be judged according to the character string length vector of each candidate reference sample and the character string length vector of the sample to be judged;
preferably, the similarity between each alternative reference sample and the sample to be determined can be determined by any one of euclidean distance, manhattan distance, chebyshev distance, minkowski distance, normalized euclidean distance, mahalanobis distance, cosine of included angle, hamming distance, and jaka distance.
And step S14, judging the maliciousness of the sample to be judged through a k-neighborhood algorithm according to the similarity of each candidate reference sample and the sample to be judged.
It should be noted that the k-nearest neighbor algorithm is a machine learning algorithm, and the idea of the method is as follows: if a sample is in a certain class among the k most similar known samples in the feature space (i.e., the nearest neighbors in the feature space), most of the known samples belong to the certain class, then the certain sample also belongs to the class. Namely: given a training data set, for a new input instance, K instances are found in the training data set that are closest to the instance, and a majority of the K instances belong to a class, into which the input instance is classified.
Specifically, in this embodiment, the "nearest neighbor" in the K-nearest neighbor algorithm means "the sample has the highest similarity", where K candidate reference samples with similarity values ranked from large to small and the most of the K candidate reference samples belong to what category, the sample to be determined belongs to this category, may be selected according to the similarity between each candidate reference sample and the sample to be determined.
According to the technical scheme, when the maliciousness of the samples is judged, the samples to be judged are matched with a known sample library firstly, known samples with different lengths of the character string length vectors and the character string length vectors of the samples to be judged are filtered, and then K nearest samples are found out from the known samples with the same length of the character string length vectors through a K nearest algorithm, wherein most of the K samples belong to which category, and the samples to be judged belong to the category. Compared with the method for judging the maliciousness of the sample by using the class name and method name matching strategy in the existing sample maliciousness judging scheme, the method for judging the maliciousness of the sample by using the characteristics that the packet name and the class are changed after confusion and the length vector of the character string is unchanged, the judgment result is not influenced by confusion, the confusion resisting means can be better realized, and the detection accuracy is higher. In addition, the character string length vector of the sample is subjected to similarity matching with the character string length vector of the known sample according to a k-neighbor algorithm, the accuracy of the sample maliciousness judgment result is further ensured by the k-neighbor algorithm, and the workload of manual analysis can be effectively reduced.
Further, referring to fig. 1A, step S11 may include:
step S111, determining an index address of each character string in the character string list according to the number of the character strings contained in the Dex file of the first sample and the character string index base address of the character string list;
specifically, for a certain sample, since 1 index occupies 4 bytes, starting from the position indicated by the string index base address of the string list of the sample Dex file, 4 bytes are taken backward to obtain the index address of the first string of the string list, and then 4 bytes are taken backward to obtain the index address of the second string, and the operations are sequentially performed until the index address of the last string is obtained.
Step S112, determining the length of each character string according to the index address of each character string;
it should be noted that the first byte of each character string represents the length of the character string.
Thus, according to the index address of a certain character string, the actual storage position of the character string is found, then the data of the character string starts from the next byte of the actual storage position, the next byte is the first byte of the character string, the next byte represented by the hexadecimal numerical value is converted into the decimal numerical value obtained after the decimal system, and the decimal numerical value is the length of the character string. According to the principle, the length of each character string is obtained.
In step S113, a string length vector is formed according to the length of each string.
Specifically, the string length vector is the length of each string that is sequentially ordered according to the front-back order of each string in the string list.
Still further, referring to fig. 1B, step S111 further includes:
step S111a, determining the number of character strings contained in the Dex file of the first sample according to the string _ ids _ size field of the Dex header part in the Dex file of the first sample;
step S111b, determining a string index base address of the string list in the Dex file of the first sample according to the string _ ids _ off field of the Dex header portion in the Dex file of the first sample.
Because the Dex file of each sample comprises 8 parts, namely Dex header, String Table, Type Table, Proto Table, Field Table, Method Table, Class Def Table and Data selection. Wherein, the Dex header part comprises string _ ids _ size field and string _ ids _ off field, see Table 1:
TABLE 1
As can be seen from table 1, in the Dex header portion of the Dex file of all samples, the offset address of the string _ ids _ size field is 0x38, and the length is 4 bytes; the string _ ids _ off field has an offset address of 0x3C and a length of 4 bytes.
Then, for a certain sample, finding the starting position of string _ ids _ size according to the offset address 0x38, and taking 4 bytes, wherein the obtained address represents the number of character strings in the character string list of the Dex file of the sample; then, the starting position of string _ ids _ off is found according to the offset address 0x3C, 4 bytes are taken, and the obtained address is the character string index base address of the character string list of the Dex file of the sample.
[ METHOD EXAMPLE 2]
Fig. 2 is a flowchart of a sample maliciousness determination method based on a Dex file according to embodiment 2 of the present invention. Referring to fig. 2, in the present embodiment, the method includes:
step S21, determining the number of character strings in the Dex file of the sample A to be determined and the character string index base address of the character string list according to the offset address and the length of the string _ ids _ size and string _ ids _ off fields of the Dex header part in the Dex file of the sample A to be determined;
specifically, in the Dex header part in the Dex file of the sample a to be determined, the offset address of the string _ ids _ size field is 0x38, and the length is 4 bytes; the string _ ids _ off field has an offset address of 0x3C and a length of 4 bytes. Referring to fig. 2A, the starting position of the string _ ids _ size field is found according to the offset address 0x38 (shown by the arrow 1a in fig. 2A), 4 bytes are taken to obtain 0x1F (shown by the reference numeral 11 in fig. 2A), where 0x1F represents the number of character strings in the Dex file character string list of the sample a to be determined, and is 31 after being converted into decimal, that is, the Dex file character string list of the sample a to be determined contains 31 character strings; then, the starting position of the string _ ids _ off field is found according to the offset address (0x3C) (shown by the arrow 1b in fig. 2A), and 4 bytes are taken to obtain 0x70 (shown by reference numeral 12 in fig. 2A), so that the string index base address of the string list of the Dex file of the sample a to be determined is 0x 70.
Step S22, determining the index address of each character string in the character string list according to the character string index base address and the number of the character strings of the character string list of the Dex file in the sample A to be judged;
in the Dex file of the sample a to be determined, the string index base address of the string list is 0x70, and referring to fig. 2A, the position indicated by the string index base address 0x70 is found first (indicated by an arrow 1c in fig. 2A), 4 bytes are taken backward, the index address of the first string of the string list of the Dex file of the sample a to be determined is 0x234 (as shown in fig. 2A by reference numeral 13), 4 bytes are taken backward, the index address of the second string is 0x23C (as shown in fig. 2A by reference numeral 14), 4 bytes are still taken backward, the index address of the third string is 0x23F (as shown in fig. 2A by reference numeral 15), and since the number of strings in the string list of the Dex file of the sample a to be determined is 31, the index addresses of 31 strings are sequentially obtained.
S23, determining the length of each character string according to the index address of each character string in the character string list of the Dex file of the sample A to be judged, and forming a character string length vector of the sample A to be judged;
specifically, taking the index address 0x234 of the first character string as an example, the position where the first character string is actually stored is found, and as shown in fig. 2B, the position indicated by the index address 0x234 is found first (i.e., as shown by an arrow 2a in fig. 2B, the 0230 row is detailed, and 4 bytes are counted later), so as to determine that the first character string starts from the 5 th byte in the 0230 row, and the 5 th byte is a hexadecimal value 06 (i.e., as shown by a reference numeral 21 in fig. 2B), and a decimal value 6 is obtained after conversion into a decimal value, which is the length of the first character string; taking the index address 0x23C of the second character string as an example, the actual storage position of the second character string is found, and the position indicated by the index address 0x23C is found first (i.e. the arrow 2B in fig. 2B shows in detail the 0230 th line, and 12 bytes are counted after), so as to determine that the second character string starts from the 13 th byte in the 0230 th line, and the 13 th byte is a 01 hexadecimal value (i.e. the reference numeral 22 in fig. 2B), and the decimal value 1 is obtained after conversion into decimal, which is the length of the second character string; and sequentially acquiring the length of each character string to form a character string length vector. In this embodiment, for the sample a to be determined, the string length vector is [6,1,4,5,2,3,6,25,33,31,25,17,26,50,18,18,18,7,12,1,2,1,19,6,7,6,11,8,5,6,12], and the length of the string length vector is 31, which is the number of the above-mentioned strings.
S24, determining alternative reference samples, wherein the alternative reference samples are known samples in a known sample library and have the same length as the character string length vector of the Dex file of the sample to be judged;
for example, assume that the library of known samples includes known samples B, C, D, E, F.
The string length vector of the Dex file in sample B is known as [ x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24, x25, x26, x27, x28, x29, x30, x31], the length is 31, and the sample is malicious.
The string length vector of the Dex file in sample C is known as [ y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30, y31], the length is 31, and the sample is malicious.
The string length vector of the Dex file in the sample D is known as [ z1, z2, z3, z4, z5, z6, z7, z8, z9, z10, z11, z12, z13, z14, z15, z16, z17, z18, z19, z20, z21, z22, z23, z24, z25, z26, z27, z28, z29, z30, z31], the length is 31, and the sample is malicious.
The string length vector of the Dex file in sample E is known as [ u1, u2, u3, u4, u5, u6, u7, u8, u9, u10, u11, u12, u13, u14, u15, u16, u17, u18, u19, u20, u21, u22, u23, u24, u25, u26, u27, u28, u29, u30, u31, u32, u33], length 33, and sample is non-malicious.
The string length vector of the Dex file in the sample F is known as [ w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31], the length is 31, and the sample is malicious.
Since the length of the string length vector of the Dex file in sample a to be determined is 31, the length of the string length vector of the Dex file in sample E is 33, and the length of the string length vector of the Dex file in sample B, C, D, F is 31, samples B, C, D, F are all known candidate reference samples, and sample E is known to be removed.
S25, determining the similarity of each candidate reference sample and the sample to be judged according to the character string length vector of each candidate reference sample and the character string length vector of the sample to be judged;
and S26, determining the maliciousness of the sample to be judged by using a k-nearest neighbor algorithm according to the similarity of each candidate reference sample and the sample to be judged.
For example, assuming that K is 3 in the K-nearest neighbor algorithm, 3 candidate reference samples with similarity values ranked from large to small and ranked at the top are selected according to the similarity between each candidate reference sample B, C, D, F and the sample a to be determined, and assuming that the candidate reference samples are candidate reference samples B, D, F, the sample a to be determined is malicious because all candidate reference samples B, D, F are malicious.
[ DEVICE EXAMPLE 1]
Fig. 3 is a schematic diagram of a sample maliciousness determination apparatus based on a Dex file according to embodiment 1 of the present invention. Referring to fig. 3, in the present embodiment, the apparatus includes:
a character string length vector determining module 31, configured to determine a character string length vector of the Dex file of a first sample, where elements in the character string length vector are sequentially lengths of character strings in the Dex file, and the length of the character string length vector is the number of character strings included in the Dex file, where the first sample is a non-shelling sample or a sample processed by a non-well-known shelling tool;
a candidate reference sample determining module 32, configured to determine a candidate reference sample, where the candidate reference sample is a sample in which a length of a string length vector in a known sample library is equal to a length of a string length vector of a sample to be determined;
a similarity determining module 33, configured to determine a similarity between each candidate reference sample and the sample to be determined according to the string length vector of each candidate reference sample and the string length vector of the sample to be determined;
and the maliciousness judging module 34 is configured to judge the maliciousness of the sample to be judged through a k-neighborhood algorithm according to the similarity between each candidate reference sample and the sample to be judged.
According to the technical scheme, when the maliciousness of the samples is judged, the samples to be judged are matched with a known sample library firstly, known samples with different lengths of the character string length vectors and the character string length vectors of the samples to be judged are filtered, and then K nearest samples are found out from the known samples with the same length of the character string length vectors through a K nearest algorithm, wherein most of the K samples belong to which category, and the samples to be judged belong to the category. Compared with the method for judging the maliciousness of the sample by using the class name and method name matching strategy in the existing sample maliciousness judging scheme, the method for judging the maliciousness of the sample by using the characteristics that the packet name and the class are changed after confusion and the length vector of the character string is unchanged, the judgment result is not influenced by confusion, the confusion resisting means can be better realized, and the detection accuracy is higher. In addition, the character string length vector of the sample is subjected to similarity matching with the character string length vector of the known sample according to a k-neighbor algorithm, the accuracy of the sample maliciousness judgment result is further ensured by the k-neighbor algorithm, and the workload of manual analysis can be effectively reduced.
The first sample may be a sample without a shell, or may be a sample processed by a non-known shell processing tool. Here, the famous shelling tool is, for example, a shelling tool provided by a famous manufacturer such as Baidu or Tencent, and the embodiment is directed at a non-famous shelling tool, and performs maliciousness determination by using a character string length vector, so that samples processed by the famous shelling tool, which are identical to the character string length vector and have universality, are removed, and the accuracy of maliciousness determination is ensured.
Further, referring to fig. 3A, the string length vector determination module 31 includes:
the character string index address determining submodule 311 is configured to determine an index address of each character string in the character string list according to the number of character strings included in the Dex file of the first sample and a character string index base address of the character string list;
a character string length determining submodule 312, configured to determine a length of each character string according to the index address of each character string;
the string length vector determining sub-module 313 is configured to form a string length vector according to the length of each string.
Further, referring to fig. 3B, the string index address determining submodule 311 includes:
a string number determining unit 311a, configured to determine, according to a string _ ids _ size field of a Dex header portion in a Dex file of a first sample, the number of strings included in the Dex file of the first sample;
a string index base address determining unit 311b, configured to determine a string index base address of a string list in the Dex file of the first sample according to a string _ ids _ off field of a Dex header portion in the Dex file of the first sample.
Optionally, the similarity determining module is specifically configured to determine the similarity between each alternative reference sample and the sample to be determined by any one of an euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a normalized euclidean distance, a mahalanobis distance, an included cosine, a hamming distance, and a jaccard distance.
An embodiment of the present invention further provides a computer device, including a processor and a memory, where the memory stores a computer program, and the processor is configured to execute the computer program stored in the memory to implement any one of the above-mentioned sample maliciousness determination based on a Dex file, or to implement the processing performed by any one of the above-mentioned sample maliciousness determination based on a Dex file.
Furthermore, an embodiment of the present invention further provides a computer storage medium, in which a computer program is stored, where the computer program, when executed by a processor, implements any one of the above-mentioned methods for determining sample maliciousness based on a Dex file, or implements processing performed by any one of the above-mentioned devices for determining sample maliciousness based on a Dex file.
Compared with the strategy of matching class names and method names in the existing sample maliciousness judging scheme, the storage medium and the computer equipment do not influence the judging result due to confusion, can better resist confusion means and have higher detection accuracy. In addition, the character string length vector of the sample is subjected to similarity matching with the character string length vector of the known sample according to a k-neighbor algorithm, the accuracy of the sample maliciousness judgment result is further ensured by the k-neighbor algorithm, and the workload of manual analysis can be effectively reduced.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts in the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.
Those skilled in the art will clearly understand that the present invention may be implemented entirely in software, or by a combination of software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, a smart phone, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present invention.
As used herein, the term "software" or the like refers to any type of computer code or set of computer-executable instructions in a general sense that is executed to program a computer or other processor to perform various aspects of the present inventive concepts as discussed above. Furthermore, it should be noted that according to one aspect of the embodiment, one or more computer programs implementing the method of the present invention when executed do not need to be on one computer or processor, but may be distributed in modules in multiple computers or processors to execute various aspects of the present invention.
Computer-executable instructions may take many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. In particular, the operations performed by the program modules may, in various embodiments, be combined or divided as desired in various different embodiments.
Also, technical solutions of the present invention may be embodied as a method, and at least one example of the method has been provided. The actions may be performed in any suitable order and may be presented as part of the method. Thus, embodiments may be configured such that acts may be performed in an order different than illustrated, which may include performing some acts simultaneously (although in the illustrated embodiments, the acts are sequential).
In various embodiments of the invention, the described features, architectures or functions may be combined in any combination in one or more embodiments, where well-known processes of operation, program modules, elements and their interconnection, linking, communication or operation with each other are not shown or described in detail. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. Those skilled in the art will also readily appreciate that the program modules, elements, or steps of the various embodiments described herein and illustrated in the accompanying drawings may be combined and arranged in a wide variety of different configurations.
Technical terms not specifically described in the present specification should be construed in the broadest sense in the art unless otherwise specifically indicated. The definitions given and used herein should be understood with reference to dictionaries, definitions in documents incorporated by reference, and/or their ordinary meanings. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As used in the claims and in the specification above, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It is to be understood that, although the terms first, second, third, etc. may be used herein to describe various information and/or modules, these information should not be limited by these terms. These terms are only used to distinguish one type of information and/or module from another. For example, a first information and/or module may also be referred to as a second information and/or module, and similarly, a second information and/or module may also be referred to as a first information and/or module without departing from the scope hereof. Additionally, the word "if" as used herein, whose meaning depends on context, may be interpreted as "at … …" or "at … …" or "in response to a determination".
In the claims, as well as in the specification above, all transitional phrases such as "comprising," "having," "containing," "carrying," "having," "involving," "consisting essentially of …," and any other variations thereof, are to be understood to be open-ended, i.e., to include, but not be limited to, non-exclusive inclusions, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The terms and expressions used in the specification of the present invention have been set forth for illustrative purposes only and are not meant to be limiting. It will be appreciated by those skilled in the art that changes could be made to the details of the above-described embodiments without departing from the underlying principles thereof. The scope of the invention is, therefore, indicated by the appended claims, in which all terms are intended to be interpreted in their broadest reasonable sense unless otherwise indicated.

Claims (10)

1. A sample maliciousness determination method based on a Dex file comprises the following steps:
determining a character string length vector of a Dex file of a first sample, wherein elements in the character string length vector are the length of each character string in the Dex file in sequence, and the length of the character string length vector is the number of the character strings contained in the Dex file, wherein the first sample is a non-shell sample or a sample processed by a non-known shell adding tool;
determining an alternative reference sample, wherein the alternative reference sample is a sample with the length of the character string length vector in a known sample library being equal to the length of the character string length vector of a sample to be judged;
determining the similarity of each alternative reference sample and the sample to be judged according to the character string length vector of each alternative reference sample and the character string length vector of the sample to be judged;
and judging the maliciousness of the sample to be judged through a k-neighborhood algorithm according to the similarity of each alternative reference sample and the sample to be judged.
2. The method of claim 1, wherein the determining the string length vector of the first sample of the Dex file comprises:
determining an index address of each character string in the character string list according to the number of the character strings contained in the Dex file of the first sample and the character string index base address of the character string list;
determining the length of each character string according to the index address of each character string;
and forming the character string length vector according to the length of each character string.
3. The Dex-file-based sample maliciousness determination method according to claim 2, further comprising:
determining the number of character strings contained in the Dex file of the first sample according to a string _ ids _ size field of a Dex header part in the Dex file of the first sample;
and determining a character string index base address of a character string list in the Dex file of the first sample according to a string _ ids _ off field of a Dex header part in the Dex file of the first sample.
4. A Dex-file-based sample maliciousness determination method according to any one of claims 1 to 3, wherein the similarity of each alternative reference sample to the sample to be determined is determined by any one of calculation methods of an euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a normalized euclidean distance, a mahalanobis distance, an included angle cosine, a hamming distance, and a jackard distance.
5. A sample maliciousness determination apparatus based on a Dex file, comprising:
the device comprises a character string length vector determining module, a character string length vector determining module and a data processing module, wherein the character string length vector determining module is used for determining a character string length vector of a Dex file of a first sample, elements in the character string length vector are the length of each character string in the Dex file in sequence, the length of the character string length vector is the number of the character strings contained in the Dex file, and the first sample is a non-shell sample or a sample processed by a non-known shell adding tool;
the candidate reference sample determining module is used for determining a candidate reference sample, wherein the candidate reference sample is a sample with the length of the character string length vector in a known sample library equal to the length of the character string length vector of a sample to be judged;
the similarity determining module is used for determining the similarity between each candidate reference sample and the sample to be judged according to the character string length vector of each candidate reference sample and the character string length vector of the sample to be judged;
and the maliciousness judging module is used for judging the maliciousness of the sample to be judged through a k-neighborhood algorithm according to the similarity of each alternative reference sample and the sample to be judged.
6. The Dex-file-based sample maliciousness determination apparatus according to claim 5, wherein the string-length-vector determination module includes:
the character string index address determining submodule is used for determining the index address of each character string in the character string list according to the number of the character strings contained in the Dex file of the first sample and the character string index base address of the character string list;
the character string length determining submodule is used for determining the length of each character string according to the index address of each character string;
and the character string length vector determining submodule is used for forming the character string length vector according to the length of each character string.
7. The Dex file-based sample maliciousness determination apparatus according to claim 6, wherein the character string index address determination submodule includes:
a string number determining unit, configured to determine, according to a string _ ids _ size field of a Dex header portion in the Dex file of the first sample, the number of strings included in the Dex file of the first sample;
and the character string index base address determining unit is used for determining the character string index base address of the character string list in the Dex file of the first sample according to the string _ ids _ off field of the Dex header part in the Dex file of the first sample.
8. A Dex-file based sample maliciousness determination apparatus as claimed in any one of claims 5 to 7, wherein the similarity determination module is specifically adapted to determine the similarity of each alternative reference sample to the sample to be determined by any one of the calculation methods of Euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance, normalized Euclidean distance, Mahalanobis distance, Included cosine, Hamming distance, and Jacard distance.
9. A computer device, comprising:
a processor; and
a memory for storing a computer program for executing a computer program,
the processor is configured to execute a computer program stored in the memory to implement the method for determining sample maliciousness based on a Dex file according to any one of claims 1 to 4.
10. A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed by a processor, the computer program implements the method for determining sample maliciousness based on a Dex file according to any one of claims 1 to 4.
CN201810617936.7A 2018-06-15 2018-06-15 Dex file-based sample maliciousness determination method and related device Active CN110610084B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810617936.7A CN110610084B (en) 2018-06-15 2018-06-15 Dex file-based sample maliciousness determination method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810617936.7A CN110610084B (en) 2018-06-15 2018-06-15 Dex file-based sample maliciousness determination method and related device

Publications (2)

Publication Number Publication Date
CN110610084A true CN110610084A (en) 2019-12-24
CN110610084B CN110610084B (en) 2022-05-17

Family

ID=68887997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810617936.7A Active CN110610084B (en) 2018-06-15 2018-06-15 Dex file-based sample maliciousness determination method and related device

Country Status (1)

Country Link
CN (1) CN110610084B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378165A (en) * 2021-06-25 2021-09-10 中国电子科技集团公司第十五研究所 Malicious sample similarity judgment method based on Jaccard coefficient
WO2023035362A1 (en) * 2021-09-07 2023-03-16 上海观安信息技术股份有限公司 Polluted sample data detecting method and apparatus for model training
CN116992448A (en) * 2023-09-27 2023-11-03 北京安天网络安全技术有限公司 Sample determination method, device, equipment and medium based on importance degree of data source

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060257027A1 (en) * 2005-03-04 2006-11-16 Alfred Hero Method of determining alignment of images in high dimensional feature space
CN102750482A (en) * 2012-06-20 2012-10-24 东南大学 Detection method for repackage application in android market
CN105975855A (en) * 2015-08-28 2016-09-28 武汉安天信息技术有限责任公司 Method and system for malicious code detection based on apk certificate similarity
CN106096405A (en) * 2016-04-26 2016-11-09 浙江工业大学 A kind of Android malicious code detecting method abstract based on Dalvik instruction
CN107273746A (en) * 2017-05-18 2017-10-20 广东工业大学 A kind of mutation malware detection method based on APK character string features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060257027A1 (en) * 2005-03-04 2006-11-16 Alfred Hero Method of determining alignment of images in high dimensional feature space
CN102750482A (en) * 2012-06-20 2012-10-24 东南大学 Detection method for repackage application in android market
CN105975855A (en) * 2015-08-28 2016-09-28 武汉安天信息技术有限责任公司 Method and system for malicious code detection based on apk certificate similarity
CN106096405A (en) * 2016-04-26 2016-11-09 浙江工业大学 A kind of Android malicious code detecting method abstract based on Dalvik instruction
CN107273746A (en) * 2017-05-18 2017-10-20 广东工业大学 A kind of mutation malware detection method based on APK character string features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王文冲等: "一种基于模糊哈希的Android变种恶意软件检测方法", 《计算机工程与应用》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378165A (en) * 2021-06-25 2021-09-10 中国电子科技集团公司第十五研究所 Malicious sample similarity judgment method based on Jaccard coefficient
CN113378165B (en) * 2021-06-25 2021-11-05 中国电子科技集团公司第十五研究所 Malicious sample similarity judgment method based on Jaccard coefficient
WO2023035362A1 (en) * 2021-09-07 2023-03-16 上海观安信息技术股份有限公司 Polluted sample data detecting method and apparatus for model training
CN116992448A (en) * 2023-09-27 2023-11-03 北京安天网络安全技术有限公司 Sample determination method, device, equipment and medium based on importance degree of data source
CN116992448B (en) * 2023-09-27 2023-12-15 北京安天网络安全技术有限公司 Sample determination method, device, equipment and medium based on importance degree of data source

Also Published As

Publication number Publication date
CN110610084B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN110610084B (en) Dex file-based sample maliciousness determination method and related device
Lindgreen AdapterRemoval: easy cleaning of next-generation sequencing reads
US7774380B2 (en) Technique for finding rest resources using an n-ary tree structure navigated using a collision free progressive hash
US8019708B2 (en) Methods and apparatus for computing graph similarity via signature similarity
US7835390B2 (en) Network traffic identification by waveform analysis
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
US20140325596A1 (en) Authentication of ip source addresses
CN108353083A (en) The system and method for algorithm (DGA) Malware is generated for detecting domains
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN104065736B (en) A kind of URL reorientation methods, apparatus and system
WO2017157335A1 (en) Message identification method and device
Sun et al. Graphs with a given diameter that maximise the Wiener index
CN103825824A (en) Message processing method and message processing device
Nataraj et al. OMD: Orthogonal malware detection using audio, image, and static features
Julstrom The blob code is competitive with edge-sets in genetic algorithms for the minimum routing cost spanning tree problem
JP2017123142A (en) System and method for detection of phishing script
CN105138918B (en) A kind of recognition methods of secure file and device
Ramaswamy et al. Approximate fingerprinting to accelerate pattern matching
CN110851367B (en) AST-based method and device for evaluating source code leakage risk and electronic equipment
CN111385360A (en) Terminal equipment identification method and device and computer readable storage medium
US20120096512A1 (en) Policy selector representation for fast retrieval
CN112671618B (en) Deep packet inspection method and device
Alkaabi et al. Modeling Cyber-Attribution Using Machine Learning Techniques
CN110868382A (en) Decision tree-based network threat assessment method, device and storage medium
CN113688289B (en) Data packet key field matching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant