CN113656763A - Method and device for determining small program feature vector and electronic equipment - Google Patents

Method and device for determining small program feature vector and electronic equipment Download PDF

Info

Publication number
CN113656763A
CN113656763A CN202110926708.XA CN202110926708A CN113656763A CN 113656763 A CN113656763 A CN 113656763A CN 202110926708 A CN202110926708 A CN 202110926708A CN 113656763 A CN113656763 A CN 113656763A
Authority
CN
China
Prior art keywords
applet
character string
vector
characteristic
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110926708.XA
Other languages
Chinese (zh)
Other versions
CN113656763B (en
Inventor
郑黄成
欧阳瑜
李佳佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AlipayCom Co ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110926708.XA priority Critical patent/CN113656763B/en
Publication of CN113656763A publication Critical patent/CN113656763A/en
Application granted granted Critical
Publication of CN113656763B publication Critical patent/CN113656763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • G06F21/121Restricting unauthorised execution of programs
    • G06F21/125Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Technology Law (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Stored Programmes (AREA)

Abstract

The embodiment of the application provides a method and a device for determining an applet feature vector and electronic equipment, which can generate a vector capable of being identified by a machine to accurately express features of an applet. The method for determining the small program feature vector comprises the following steps: extracting a plurality of characteristic character strings in sequence in program data of the small program, wherein the program data comprises at least one of the following types of program data: a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet; generating a characteristic character string sequence of the small program according to the plurality of characteristic character strings; converting the characteristic character string sequence of the applet into a characteristic character string vector; and inputting the characteristic character string vector into a trained deep learning model to generate a characteristic vector of the small program.

Description

Method and device for determining small program feature vector and electronic equipment
[ technical field ] A method for producing a semiconductor device
The embodiment of the application relates to the technical field of small programs, in particular to a method and a device for determining a feature vector of a small program and electronic equipment.
[ background of the invention ]
An applet is an application that can be used without downloading and installing, and usually depends on a certain applet platform (other application software), after a user downloads and installs an application that can serve as the applet platform, the user can enter the applet through an applet entry (such as an applet icon and an applet search result option) provided in the application software to use functions provided by the applet.
[ summary of the invention ]
The embodiment of the application provides a method and a device for determining an applet feature vector, and electronic equipment, so as to generate a vector which can be identified by a machine to accurately express features of an applet.
In a first aspect, an embodiment of the present application provides a method for determining an applet feature vector, where the method includes: extracting a plurality of characteristic character strings in sequence in program data of the small program, wherein the program data comprises at least one of the following types of program data: a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet; generating a characteristic character string sequence of the small program according to the plurality of characteristic character strings; converting the characteristic character string sequence of the applet into a characteristic character string vector; and inputting the characteristic character string vector into a trained deep learning model to generate a characteristic vector of the small program.
In one possible implementation manner, converting a characteristic string sequence of an applet into a characteristic string vector includes: and replacing each characteristic character string in the characteristic character string sequence with a corresponding numerical index code according to the mapping relation between the character string and the numerical index code in the preset index mapping table to obtain a characteristic character string vector.
In one possible implementation manner, in a case where the program data includes a plurality of categories, generating a characteristic string sequence of the applet according to the plurality of characteristic strings includes: respectively extracting feature character strings not exceeding a preset number from the feature character strings corresponding to each type of program data; and combining the extracted characteristic character strings to obtain a characteristic character string sequence.
In one possible implementation manner, the program data includes a package file structure of the applet, and the extracting a plurality of feature character strings in sequence in the program data of the applet includes: and extracting the file name and the file type suffix of each file according to the structure sequence of the package file structure to obtain a file name characteristic character string of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
In one possible implementation manner, generating a feature string sequence of the applet according to the plurality of feature strings includes: extracting a character string of a suffix of a target file type from a file name characteristic character string obtained according to a package file structure to obtain a characteristic character string corresponding to the package file structure; and generating a characteristic character string sequence according to the extracted character string.
In one possible implementation manner, the program data includes a static code file of the applet, and the extracting a plurality of feature character strings in sequence in the program data of the applet includes: selecting a plurality of target code files from static code files of the small program; matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string; and splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
In one possible implementation manner, the program data includes dynamic operation data of the applet, and the extracting a plurality of feature character strings in sequence in the program data of the applet includes: running the small program; matching a request generated in the running process of the small program with preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request; and splitting the hit request to obtain a plurality of characteristic character strings.
In one possible implementation manner, before replacing each characteristic character string in the characteristic character string sequence with the numeric index code of the corresponding character string according to the mapping relationship between the character string and the numeric index code in the preset index mapping table, the method further includes: determining unrepeated character strings which appear in the plurality of characteristic character strings and do not appear in a preset index mapping table to obtain unknown character strings; distributing non-repeated numerical index codes for each unknown character string; and storing the mapping relation between the unknown character string and the corresponding digital index code in the preset index mapping table so as to update the preset index mapping table.
In one possible implementation manner, generating a feature string sequence of the applet according to the plurality of feature strings includes: calculating a word frequency-inverse text frequency index TF-IDF aiming at each character string in the updated preset index mapping table to obtain a score of each character string in the preset index mapping table; and generating a feature character string sequence of the small program according to the feature character strings with the scores exceeding the preset scores in the plurality of feature character strings.
In one possible implementation manner, before inputting the feature string vector into the trained deep learning model to generate the feature vector of the applet, the method further includes: training a coding and decoding model by using a plurality of training vectors, wherein each training vector is a characteristic character string vector of a small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the training vectors to obtain output vectors, the decoding model is used for decoding the output vectors of the coding model to obtain the output vectors, and the optimization goal of the training coding and decoding model is to reduce loss values calculated according to the output vectors and the training vectors; and determining that a training convergence condition is reached to obtain a trained coding model.
In one possible implementation manner, after inputting the feature string vector into the trained deep learning model to generate the feature vector of the applet, the method further includes: and determining the similarity of the small program and other small programs according to the feature vector of the small program and the feature vectors of other small programs.
In one possible implementation manner, after determining the similarity between the applet and the other applets, the method further includes: acquiring preset labels of other applets; and determining the label of the small program according to the preset labels of other small programs.
In one possible implementation manner, the preset tag is used to mark whether the corresponding applet is a malicious applet.
According to the method and the device, the characteristic character strings are extracted in sequence from one or more small program data such as a small program package file structure, a small program static code file, small program dynamic running data and the like, the characteristic character string sequence of the small program is generated according to the characteristic character strings, the characteristic character string sequence is converted into a characteristic character string vector and then input into a trained coding model to generate the characteristic vector of the small program, so that the vector which can be recognized by a machine can be generated to accurately express the characteristics of the small program, and the technical problem that the characteristics of the small program cannot be expressed is solved.
In a second aspect, an embodiment of the present application provides an apparatus for determining an applet feature vector, including: the extraction unit is used for extracting a plurality of characteristic character strings in sequence in the program data of the small program, wherein the program data comprises at least one of the following types of program data: a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet; a first generation unit configured to generate a characteristic character string sequence of the applet from the plurality of characteristic character strings; the conversion unit is used for converting the characteristic character string sequence of the applet into a characteristic character string vector; and the second generation unit is used for inputting the characteristic character string vector into the trained deep learning model so as to generate the characteristic vector of the small program.
In one possible implementation manner, the conversion unit is further configured to replace each feature character string in the feature character string sequence with a corresponding numeric index code according to a mapping relationship between a character string and a numeric index code in a preset index mapping table, so as to obtain a feature character string vector.
In one possible implementation manner, in a case where the program data includes a plurality of kinds, the first generating unit includes: the first extraction module is used for extracting the characteristic character strings of which the number is not more than the preset number in the characteristic character strings corresponding to the program data of each type; and the combination module is used for combining the extracted characteristic character strings to obtain a characteristic character string sequence.
In one possible implementation manner, the program data includes a package file structure of the applet, and the extraction unit includes: and the second extraction module is used for extracting the file name and the file type suffix of each file according to the structure sequence of the package file structure so as to obtain a file name characteristic character string of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
In one possible implementation manner, the first generating unit includes: the third extraction module is used for extracting the character string of the suffix of the target file type from the file name characteristic character string obtained according to the package file structure so as to obtain a characteristic character string corresponding to the package file structure; and the first generation module is used for generating a characteristic character string sequence according to the extracted character string.
In one possible implementation manner, the program data includes a static code file of the applet, and the extraction unit includes: the selection module is used for selecting a plurality of target code files from the static code files of the small program; the matching module is used for matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string; and the splitting module is used for splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
In one possible implementation manner, the program data includes dynamic operation data of the applet, and the extraction unit includes: the running module is used for running the small program; the grabbing module is used for grabbing requests generated in the running process of the small programs; the matching module is used for matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request; and the splitting module is used for splitting the hit request to obtain a plurality of characteristic character strings.
In one possible implementation manner, the apparatus further includes: the first determining unit is used for determining unrepeated character strings which appear in the plurality of characteristic character strings and do not appear in the preset index mapping table before the characteristic character string vectors are obtained by the converting unit, so as to obtain unknown character strings; the distribution unit is used for distributing non-repeated numerical index codes for each unknown character string; and the storage unit is used for storing the mapping relation between the unknown character string and the corresponding digital index code in the preset index mapping table so as to update the preset index mapping table.
In one possible implementation manner, the first generating unit includes: the calculation module is used for calculating a word frequency-inverse text frequency index TF-IDF aiming at each character string in the updated preset index mapping table so as to obtain a score of each character string in the preset index mapping table; and the second generation module is used for generating a feature character string sequence of the small program according to the feature character string with the score exceeding the preset score in the feature character strings.
In one possible implementation manner, the apparatus further includes: the training unit is used for training a coding and decoding model by using a plurality of training vectors before the second generation unit generates the feature vectors of the small programs, wherein each training vector is a feature character string vector of one small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the training vectors to obtain output vectors, the decoding model is used for decoding the output vectors of the coding model to obtain the output vectors, and the optimization goal of the training coding and decoding model is to reduce loss values calculated according to the output vectors and the training vectors; and the second determining unit is used for determining that the training convergence condition is reached so as to obtain the trained coding model.
In one possible implementation manner, the apparatus further includes: and the third determining unit is used for determining the similarity between the small program and other small programs according to the feature vector of the small program and the feature vectors of other small programs after the feature vector of the small program is generated by the second generating unit.
In one possible implementation manner, the apparatus further includes: the acquisition unit is used for acquiring the preset labels of other applets after the third determination unit determines the similarity between the applets and the other applets; and the fourth determining unit is used for determining the label of the small program according to the preset labels of other small programs.
In one possible implementation manner, the preset tag is used to mark whether the corresponding applet is a malicious applet.
In a third aspect, an embodiment of the present application provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor calling the program instructions to be able to perform the method provided by the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the method provided in the first aspect.
It should be understood that the second to fourth aspects of the embodiments of the present application are consistent with the technical solutions of the first aspect of the embodiments of the present application, and beneficial effects obtained by the aspects and the corresponding possible implementation manners are similar and will not be described again.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments are briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow diagram of one embodiment of a method for determining an applet feature vector according to an embodiment of the present application;
FIG. 2 is a flowchart of another embodiment of a method for determining an applet feature vector according to an embodiment of the present application;
FIG. 3 is a block diagram illustrating an embodiment of an apparatus for determining an applet feature vector according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an embodiment of an electronic device according to the present application.
[ detailed description ] embodiments
In order to better understand the technical solutions of the embodiments of the present application, the embodiments of the present application are described in detail below with reference to the accompanying drawings.
It should be understood that the embodiments described are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the embodiments in the present application.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
Before the applet is released to an applet platform (also called applet racking), whether the applet is legal and compliant needs to be checked manually, the applet is racking after the check is passed, the check cost is high, and after the applet is racking, an applet developer may change contents (such as a website and the like) stored in a server called in the applet, so that the applet is violated after the applet is racking. The small program needs to depend on a small program platform, so that templates used in the development of the small program are similar in particularity, the illegal malicious small program has similar characteristics, and the problem of how to more accurately express and generate the characteristics of the small program is needed to be solved at present.
The embodiment of the application provides a method for determining an applet feature vector, which can be applied to electronic equipment with computing and storage capabilities, such as a server, a workstation, a laptop, a mobile communication terminal, etc., may particularly be in the form of a program, a client or a software platform, etc., that provides a user with access to software modules that implement the method, e.g., when the portal is provided in the form of a client, a user may download the client, upload the applet data to a remote server (such as a cloud server) through a software client, execute the method for determining the applet feature vector provided in the embodiment of the present application by the remote server, other ways are not described in detail herein, and those skilled in the art can provide the functions of the method provided in the embodiments of the present application to the user by using other types of entry ways according to the above exemplary description.
Fig. 1 is a flowchart of an embodiment of a method for determining an applet feature vector according to the present application, where as shown in fig. 1, the method for determining an applet feature vector may include:
step 101, a plurality of characteristic character strings are sequentially extracted from the program data of the applet.
An applet is an application that can be used without downloading and installing, is one type of application, and can run on a designated platform (application). The small program can be developed by a small program developer aiming at one application or a plurality of applications which can be compatible, the small program developer provides a manager of the application for auditing after the development is completed, and if the auditing is passed, the small program can be uploaded to the application, so that a user using the application can enter the small program through a plurality of entries (such as an icon of the small program, an option of a small program search result and the like) and use the functions provided by the small program.
The program data of the applet is data related to the program of the applet and may include file contents of the applet and data generated when the applet operates. The file content of the applet may be the file name, file type suffix, or the content of the code, etc., of the package file structure of the applet. The data generated during the running of the applet can be all data generated during the running of the applet, the called underlying method of the terminal equipment system, the issued request and the like.
The program data of the applet may comprise at least one of the following categories of program data: package file structure of the small program; a static code file of the applet; and thirdly, dynamic running data of the small program. For example, the program data may include a package file structure of the applet, or include a static code file of the applet and dynamic operation data of the applet, or include a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet, which are not illustrated.
In the case that the applet is developed using Java language, the applet is used in the form of a Jar package, and a Jar package file of the applet has a certain structure, which is referred to as a package file structure in the embodiment of the present application, and the specific structure may be a tree structure. The file name and the suffix of the package file structure carry certain information, and the package structure also carries certain information. Specifically, in the case where the program data includes a package file structure of the applet, in executing step 101, a file name and a file type suffix of each file may be extracted in the structural order of the package file structure to obtain a file name characteristic character string of each file, where each file name characteristic character string includes a file name and a file type suffix of the corresponding file. The file name characteristic character string can be obtained based on Depth-First-Search (DFS) or Breadth-First-Search (BFS) according to the structural sequence extraction. For example, the filename feature string may be index.js, webview.js, webview.axml, body.jpg, title.png, and so on, where "." is preceded by a filename and ". is followed by a file type suffix.
The program data may include a static code file of the applet, and accordingly, the step 101 of sequentially extracting a plurality of characteristic character strings in the program data of the applet may include the steps of:
and step 111, selecting a plurality of target code files from the static code files of the small program.
An optional implementation manner of selecting the target code file is that several keywords may be preset, and file names of all static code files are screened, so that the obtained file is the target code file. The keyword may be to filter out a core code file in the static code file, and the core code file may be used as the target code file. This is because there is a lot of noise in the pure code text, so the features can be extracted only for the core code file, specifically, the core code file may be a main entry configuration file or a home page presentation code file, etc. Optionally, the target code file may be selected according to the file capacity of the code file, and the first files (preset values) with larger capacity are used as the target code file.
And step 112, matching a preset regular expression in each target code file.
After the target code files are selected, matching is carried out in each target code file by using a preset regular expression. The preset regular expression includes one or more target character strings and a matching rule for each target character string. Regular expression (regular expression) describes a pattern of string matching (pattern) that can be used to select strings that meet predefined conditions from all the strings of the object code file. For example, since an applet has a closed property, the dynamic control logic usually uses an httpress class, and the string matched to the code calling the httpress class can be as follows by presetting a corresponding regular expression:
“httprequest->url->success->if->setData->display0->else->setData->display1”。
and 113, splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
After the hit code segment is determined, the code segment can be split into a plurality of character strings according to space characters such as spaces, carriage returns and the like between every two adjacent words in the code, and a plurality of code characteristic character strings are obtained.
The program data may include dynamic operation data of the applet, and accordingly, the step 101 of sequentially extracting a plurality of feature strings in the program data of the applet may include the steps of:
step 121, running the applet.
In particular, the applet may be run in a simulated runtime environment.
Step 122, the request generated in the running process of the applet is captured.
Step 123, matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request;
step 124, splitting the hit request to obtain a plurality of characteristic character strings.
The generated request may carry various information, in order to avoid interference caused by excessive information, only part of kinds of information may be extracted, for example, Header and Response information, and the specific method may be to match a preset character string in the request, where the preset character string is a character string that may appear in the required information, and use the request in which the preset character string appears as a request feature character string, for example, the request feature character string may be: a header: www.zzryy.cn, response: status, etc.
And 102, generating a characteristic character string sequence of the small program according to the plurality of characteristic character strings.
After extracting a plurality of characteristic character strings in the program data, the plurality of characteristic character strings may be combined in the order of extraction to generate a characteristic character string sequence. For example, if the sequentially extracted feature strings include index.js, webview.js, webview.axml, body.jpg, and title.png, the feature string sequence is { index.js, webview.js, webview.axml, body.jpg, and title.png }.
In an alternative embodiment, if the program data includes multiple categories of program data, the categories of program data may be concatenated together in a predetermined order to generate the sequence of characteristic strings. For example, the sequentially extracted file name feature strings include index.js, webview.js, webview.axml, body.jpg, and title.png, and the sequentially extracted code feature strings include httpreq, url, success, if, setData, display0, else, setData, and display1, and then the sequence of the generated feature strings is, in the order of the file name feature string before and the code feature string after: js, webview. axml, body. jpg, title. png, httpress, url, success, if, setData, display0, else, setData, display1 }.
In an alternative embodiment, in order to align the feature string sequence of each applet, a preset number of feature strings may be selected to form the feature string sequence, and if the number is less than the preset number, the feature string sequence is complemented with a preset string (e.g., 0 or non, which is not limited in this embodiment of the present application). If the program data includes a plurality of categories of program data, a corresponding preset number may be set for each category, for example, 100 character strings are selected in order among each of the file name characteristic character string, the code characteristic character string, and the request characteristic character string, and the use non of less than 100 character strings is filled up.
In an optional implementation manner, each character string may also be evaluated in advance by using an index, such as a word frequency-inverse text frequency index TF-IDF, and the index is used as a score of the character string, so that a character string with a higher score is selected from a plurality of feature character strings according to the score, and the selected character strings are combined to obtain a feature character string sequence. Taking program data including three types of program data (package file structure of applet; static code file of applet; dynamic operation data of applet) as an example, the calculation formula of TF-IDF is as follows:
TF is the number of times the target string is hit in a code file or a request/the number of times the target string appears in all strings;
IDF ═ log [ (total number of files + total number of requests + 1)/(number of files or requests including target character strings) ] + 1;
TF-IDF=TF*IDF。
in an optional implementation manner, the program data includes a package file structure of the applet, and the feature character string extracted in the package file structure is a filename feature character string, where the filename feature character string includes a filename and a file type suffix, and when step 102 is executed, the feature character string of the (target) file type of interest may also be extracted, for example, in an application scenario, an applet auditor finds that some applets have illegal pictures, so that the auditor may be currently interested in features of files of the picture type, and it is necessary to extract filename feature character strings of jpg and png file types from the filename feature character string. Specifically, in this alternative embodiment, when step 102 is executed, the following steps may be executed:
step 201, extracting a character string of a suffix of a target file type from a file name characteristic character string obtained according to a package file structure to obtain a characteristic character string corresponding to the package file structure;
step 202, generating a characteristic character string sequence according to the extracted character string.
Optionally, in the step 102, the implementation may be performed according to any one of the optional embodiments described above, or may be performed in combination with multiple optional embodiments, for example, after the feature character strings are respectively extracted in sequence from three types of program data, the TF-IDF score of each feature character string is determined, then, in the feature character string corresponding to each type of program data, the feature character string with the score ranking 20 bits first is retained, the feature character string after 20 bits of the score ranking is deleted, and at most 100 feature character strings of each type are retained.
For example, the characteristic character strings extracted from the program data of type (I) are F, Y, D, I, N, C, I, a, T, … …, the characteristic character strings extracted from the program data of type (ii) are W, D, P, Q, B, X, D, … …, and the characteristic character strings extracted from the program data of type (ii) are R, U, S, F, a, T, D, Z, … ….
The characteristic character strings with the scores sorted before 20 bits are A-T, then the characteristic character strings with the scores sorted after 20 bits in the characteristic character strings corresponding to the program data of each type are deleted, after deletion, the number of the reserved characteristic character strings of the type I is less than 100, the rest characteristic character strings are all complemented by non, and the number of the reserved characteristic character strings of the types II and III is more than 100, then the characteristic character strings after the 100 th characteristic character string are deleted.
And combining the reserved characteristic character strings according to the original sequence to obtain a characteristic character string sequence { F, D, I, N, C, I, A, T, … …, non, non, … …, D, P, Q, B, D, … …, R, S, F, A, T, D, … … }.
And 103, replacing each characteristic character string in the characteristic character string sequence with the digital index code of the corresponding character string according to the mapping relation between the character string and the digital index code in the preset index mapping table to obtain the characteristic character string vector of the applet.
It should be noted that the characteristic character string vector refers to a number vector for representing a characteristic character string sequence, and each characteristic character string in the characteristic character string sequence is identified by a corresponding number, so as to convert a character string which may include letters, symbols and the like into a character string of pure numbers, so that the deep learning model can recognize the character string.
An optional implementation manner is that a plurality of mapping relationships are stored through a preset index mapping table, each mapping relationship is a corresponding relationship between one character string and one numeric index code, the character strings in different mapping relationships are not repeated, and the numeric index codes in different mapping relationships are also not repeated. The numeric index code is a number. For example, the preset index mapping table may include the following mapping relationship:
{1:“index”,2:“webview”,3:“title”,……}
the front number is a number index code, and the corresponding vocabulary is a characteristic character string.
And then, replacing each characteristic character string in the characteristic character string sequence with a corresponding numerical index code according to the mapping relation between the character string and the numerical index code in the preset index mapping table to obtain a characteristic character string vector.
In an alternative embodiment, the predetermined index mapping table is obtained before step 103 is performed.
Specifically, the step of obtaining the preset index mapping table may include:
step 301, determining unrepeated character strings which appear in the plurality of characteristic character strings and do not appear in the preset index mapping table, and obtaining unknown character strings.
And (4) carrying out duplication removal on a plurality of characteristic character strings of the small program extracted in the step (101), and removing the character strings existing in the preset index mapping table. Since one or more mapping relationships may have been stored in the preset index mapping table in advance, these mapping relationships stored in advance may be the mapping relationships stored when the feature vectors of other applets are generated by using the method provided in the embodiment of the present application. And after the duplication removal, obtaining a character string which does not exist in the preset index mapping table, and obtaining an unknown character string.
Step 302, assigning each unknown string a non-repeating numeric index code.
It should be noted that the non-repeated numeric index code means that the numeric index codes of different unknown character strings are different, and are also different from the existing numeric index codes in the preset index mapping table.
Step 303, storing the mapping relationship between the unknown character string and the corresponding numeric index code in the preset index mapping table to update the preset index mapping table.
And 104, inputting the characteristic character string vector into the trained deep learning model to generate a characteristic vector of the applet.
The feature vectors can be viewed as "fingerprints" of applets, the feature vectors being different for different applets. Optionally, the deep learning model may be a coding model in a coding and decoding model based on a Seq2Seq (sequence to sequence) framework or based on a Seq2Seq + Attention (sequence to sequence + Attention) framework in the prior art, or the deep learning model may also be a neural network model based on a transform framework in the prior art, and a specifically adopted model in this embodiment of the present application is not specifically limited to this, and may be set according to a specific situation.
Taking the coding model as an example, the coding (encoder) model is a model used for coding in a coding-decoding (encoder-decoder) model, and the coding-decoding model also comprises a decoding (decoder) model. The coding model in the coding and decoding model is used for outputting a vector according to an input vector (each element in the vector is input one by one according to the sequence of the vector), the output vector is used as a characteristic vector for expressing the characteristics of the small program, and then the characteristic vector output by the coding model is input into the decoding module, so that the optimization goal of training the coding and decoding model is to reduce the loss value calculated according to the output vector output by the decoding model and the training vector input into the coding model.
The trained coding model is trained in advance, at least before step 104 is executed, the codec model is trained using a plurality of training vectors, specifically, each training vector is used to train the codec model, and after each training, parameters of the codec model are adjusted according to a loss value between an output result (i.e., a vector output by the codec model) and a target vector (which may be a training vector input to the codec model, or may be a vector determined according to a training vector input to the codec model, for example, a reverse vector of the training vector), specifically, when one of the training vectors is used to pair the codec model, specific steps of an alternative embodiment may include:
step 401, selecting a training vector from a plurality of training vectors, and inputting the training vector into a current coding model to obtain output feature vectors, wherein each training vector is a feature character string vector of a small program;
step 402, inputting the feature vector output by the coding model into a decoding model;
step 403, obtaining a vector output by the decoding model;
step 404, determine the reverse order vector of the training vector, for example, if the training vector is {12,31,56}, then the reverse order vector is {56,31,12 }.
Step 405, adjusting the weight parameters in the coding model and the decoding model according to the loss value between the vector output by the decoding model and the reverse-order vector of the training vector.
At step 406, it is determined whether a training convergence condition is reached, for example, the training convergence condition may be that a specified number of iterations are trained or a loss between an output result and an expected result is less than a preset threshold.
Specifically, the reverse order vector Sn of the training vector and the decoding model are calculatedLoss (loss) value sequence _ loss (S) between output vectors Snn,S′n) The formula of (c) may be:
Figure BDA0003209529090000161
in the encoding model and the decoding model, one or more Neural Network units may be respectively included, for example, the Neural Network Unit may be a Recurrent Neural Network (RNN) Unit, or the Neural Network Unit may also be a Long Short-Term Memory (LSTM) Unit, where the LSTM Unit is a time Recurrent Neural Network, or the Neural Network Unit may also be a Bidirectional Recurrent Neural Network, such as a Bidirectional Long Short-Term Memory (bllstm) Network, or a Bidirectional gate Recurrent Unit (BiGRU).
Optionally, an Attention (Attention) mechanism may be further added to the above described codec model, that is, a codec model based on the Seq2Seq + Attention framework is adopted, and the training process is similar to the above steps 401 to 406, except that in step 403, when each element of the output vector of the codec model is output, the state vector corresponding to the element is introduced as one of the inputs. The specific structure of the codec model introducing the attention mechanism can refer to the related art, and is not described herein.
The feature vectors of the applets obtained after the step 104 is executed may be used to calculate similarity with other applets, and specifically, the similarity between two applets may be vector cosine values of the feature vectors of the two applets.
In order to identify an applet similar to a known malicious applet, the eigenvector of the unknown applet may be compared with the vector cosine value of the eigenvector of the known malicious applet, and if the vector cosine value is close to 1 (the difference between the vector cosine value and 1 is smaller than a pre-specified threshold), the unknown applet is considered similar to the malicious applet and may also be a malicious applet. Optionally, in another embodiment, in a library in which whether a plurality of applets are known to be malicious or not, vector cosine values of feature vectors of each applet and feature vectors of unknown applets are respectively calculated, and sorted according to the magnitude of the vector cosine values, the number of applets which are malicious applets in n applets most similar to the unknown applet is determined, and if the number of the applets exceeds a preset number, the unknown applet is determined to be malicious applet.
According to the method and the device, the characteristic character strings are extracted in sequence from one or more small program data such as a small program package file structure, a small program static code file, small program dynamic running data and the like, the characteristic character string sequence of the small program is generated according to the characteristic character strings, the characteristic character string sequence is converted into a characteristic character string vector and then input into a trained coding model to generate the characteristic vector of the small program, so that the vector which can be recognized by a machine can be generated to accurately express the characteristics of the small program, and the technical problem that the characteristics of the small program cannot be expressed is solved.
Further, an optional specific implementation of the method for determining an applet feature vector is also provided in the embodiments of the present application, as shown in fig. 2.
As shown in fig. 2, first, character strings are extracted for three types of program data (package file structure, static code file, dynamic run data) of the applet.
For a package file structure, performing depth-first traversal or breadth-first traversal based on a tree structure of a small package file structure, obtaining file name feature character strings according to a traversal sequence, wherein each file name feature character string comprises a file name and a file type suffix, and combining the file name feature character strings to obtain a first type sequence shown in fig. 2:
{′index.js′,′webview.js′,...,′index.axml′,′webview.axml′,...,′title.jpg′,′body.jpg′,...,′title.png′,′body.png′}
optionally, the filename characteristic character string may be stored in a classified manner according to different suffixes of file types, for example, the file name characteristic character string is divided into a directory sequence, a js file sequence, an axml file sequence, and a picture file sequence (a picture file with a file type suffix of jpg or png, etc.), that is, in the first-class sequence, the file name characteristic character string is divided into a plurality of subclass sequences according to different file types.
For a static code file, determining a core code file including a main entry configuration file and a home page display code file, extracting a code segment matched with a preset regular matching formula (such as httpress) from the core code file, splitting the code segment into a plurality of characteristic character strings, and combining to obtain a second type sequence shown in fig. 2:
{′httprequest′,′url′,′success′if′,′setdata′,′display0′,′else′,′setdata′,′display1′,...}
the dynamic operation data includes a request generated in the applet operation process, the corresponding request is matched according to a target character string (such as a header, a response, and the like) in the request to obtain a plurality of request characteristic character strings, and a third type sequence shown in fig. 2 is obtained after combination:
{′header:www.zzgryy.cn′′header:47.91.249.40′,...,′response:status′,′response:show′,′response:font′,...}
as shown in fig. 2, after the first-class sequence, the second-class sequence and the third-class sequence of the applet are obtained, a partial character string is selected for each class of sequences.
And determining a TF-IDF score corresponding to each character string for each type of sequence, deleting the character strings with scores lower than a preset score, and keeping the character strings with scores higher than the preset score, wherein if the number of the character strings kept by a certain type of sequence is less than 100, less character strings are filled by non, and if the number of the character strings kept by a certain type of sequence is more than 100, only 100 character strings are kept.
As shown in fig. 2, after each type of sequence selects a partial character string, the partial character strings are combined to obtain a characteristic character string sequence.
Acquiring a preset number index code mapping table, where the preset number index code mapping table may be:
{1:′index′,2:′webview′,3:′title′,4:′body′,5:′httprequest′.6:′url′,7:′success′,8:′if′,9:′setdata′,10:′display0′,....}
as shown in fig. 2, after determining the numeric index corresponding to each character string in the sequence of feature character strings according to the preset numeric index mapping table, a feature character string vector may be obtained, for example:
{2,5,6,8,......0,0,0,s100,152,24,......0,0,0,s200,255,826,145,......0,0,0}
as shown in fig. 2, after the feature string vector is obtained, the feature string vector is input to the coding model, and the output is the feature vector of the applet.
The above steps shown in fig. 2 are the process of obtaining the feature vector of the applet. Optionally, after obtaining a feature vector of a small program during training of the coding model, the following steps are further included as shown in fig. 2:
as shown in fig. 2, the output of the decoding model is an output vector, wherein the loss value is calculated according to the output vector and the characteristic string vector (i.e. training vector) of the input coding model.
After each round of training obtains the result, parameters in the coding model and the decoding model can be adjusted according to the loss value, and when the coding and decoding model is trained according to the characteristic character string vector of the next small program, the coding and decoding model after the parameters are adjusted is used.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for determining an applet feature vector according to the embodiment of the present application, and as shown in fig. 3, the apparatus for determining an applet feature vector may include: an apparatus for determining an applet feature vector, comprising: an extraction unit 10, a first generation unit 20, a conversion unit 30 and a second generation unit 40.
The extracting unit 10 is configured to sequentially extract a plurality of characteristic character strings from program data of an applet, where the program data includes at least one of the following types of program data: a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet; a first generation unit 20 for generating a characteristic character string sequence of the applet from the plurality of characteristic character strings; a conversion unit 30 for converting the characteristic string sequence of the applet into a characteristic string vector; and the second generating unit 40 is used for inputting the characteristic character string vector into the trained coding model so as to generate the characteristic vector of the small program.
Optionally, the conversion unit is further configured to replace each characteristic character string in the characteristic character string sequence with a corresponding numeric index code according to a mapping relationship between a character string and a numeric index code in the preset index mapping table, so as to obtain a characteristic character string vector.
Alternatively, in the case where the program data includes a plurality of kinds, the first generation unit 20 includes: the first extraction module is used for extracting the characteristic character strings of which the number is not more than the preset number in the characteristic character strings corresponding to the program data of each type; and the combination module is used for combining the extracted characteristic character strings to obtain a characteristic character string sequence.
Optionally, the program data includes a package file structure of the applet, and the extracting unit 10 includes: and the second extraction module is used for extracting the file name and the file type suffix of each file according to the structure sequence of the package file structure so as to obtain a file name characteristic character string of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
Optionally, the first generating unit 20 includes: the third extraction module is used for extracting the character string of the suffix of the target file type from the file name characteristic character string obtained according to the package file structure so as to obtain a characteristic character string corresponding to the package file structure; and the first generation module is used for generating a characteristic character string sequence according to the extracted character string.
Optionally, the program data includes a static code file of the applet, and the extracting unit 10 includes: the selection module is used for selecting a plurality of target code files from the static code files of the small program; the matching module is used for matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string; and the splitting module is used for splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
Optionally, the program data includes dynamic operation data of the applet, and the extraction unit 10 includes: the running module is used for running the small program; the grabbing module is used for grabbing requests generated in the running process of the small programs; the matching module is used for matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request; and the splitting module is used for splitting the hit request to obtain a plurality of characteristic character strings.
Optionally, the apparatus further comprises: a first determining unit, configured to determine, before the converting unit 30 obtains the feature string vector, a nonrepeating string that appears in the plurality of feature strings and does not appear in the preset index mapping table, so as to obtain an unknown string; the distribution unit is used for distributing non-repeated numerical index codes for each unknown character string; and the storage unit is used for storing the mapping relation between the unknown character string and the corresponding digital index code in the preset index mapping table so as to update the preset index mapping table.
Optionally, the first generating unit 20 includes: the calculation module is used for calculating a word frequency-inverse text frequency index TF-IDF aiming at each character string in the updated preset index mapping table so as to obtain a score of each character string in the preset index mapping table; and the second generation module is used for generating a feature character string sequence of the small program according to the feature character string with the score exceeding the preset score in the feature character strings.
Optionally, the apparatus further comprises: a training unit, configured to train a coding and decoding model using a plurality of training vectors before the second generating unit 40 generates the feature vectors of the applet, where each training vector is a feature string vector of the applet, the coding and decoding model includes a coding model and a decoding model, the coding model is configured to code the training vectors to obtain output vectors, the decoding model is configured to decode the output vectors of the coding model to obtain output vectors, and an optimization goal of the training coding and decoding model is to reduce a loss value calculated according to the output vectors and the training vectors; and the second determining unit is used for determining that the training convergence condition is reached so as to obtain the trained coding model.
Optionally, the apparatus further comprises: and a third determining unit, configured to determine, after the second generating unit 40 generates the feature vector of the applet, a similarity between the applet and another applet according to the feature vector of the applet and the feature vectors of the other applet.
Optionally, the apparatus further comprises: the acquisition unit is used for acquiring the preset labels of other applets after the third determination unit determines the similarity between the applets and the other applets; and the fourth determining unit is used for determining the label of the small program according to the preset labels of other small programs.
Optionally, the preset tag is used to mark whether the corresponding applet is a malicious applet.
The apparatus for determining an applet feature vector provided in the embodiment shown in fig. 3 may be used to implement the technical solution of the method embodiment shown in fig. 1 or 2 in the embodiment of the present application, and further reference may be made to the relevant description in the method embodiment for implementing the principle and the technical effect.
Fig. 4 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the present application, and as shown in fig. 4, the electronic device may include at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the method for determining the feature vector of the applet provided in the embodiments of fig. 1-2 of the present application.
FIG. 4 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. It should be noted that the electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, the electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: one or more processors 410, a memory 430, and a communication bus 440 that connects the various system components (including the memory 430 and the processors 410).
Communication bus 440 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Electronic devices typically include a variety of computer system readable media. Such media may be any available media that is accessible by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 430 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) and/or cache Memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Memory 430 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility having a set (at least one) of program modules, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in memory 430, each of which examples or some combination may include an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the embodiments described herein.
The processor 410 executes programs stored in the memory 430 to perform various functional applications and data processing, such as implementing the methods for determining the feature vectors of the applets provided in the embodiments of the present application illustrated in fig. 1-2.
Embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions, which cause a computer to execute a method for determining an applet feature vector provided in an embodiment shown in fig. 1 to fig. 2.
The non-transitory computer readable storage medium described above may take any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
In the description of embodiments of the present application, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of embodiments of the present application. In the embodiments of the present application, the schematic representations of the terms described above are not necessarily intended to be the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples and features of the various embodiments or examples described in this application can be combined and combined by those skilled in the art without conflicting.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the embodiments of the present application, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It should be noted that the terminal according to the embodiments of the present application may include, but is not limited to, a Personal Computer (Personal Computer; hereinafter, referred to as PC), a Personal Digital Assistant (Personal Digital Assistant; hereinafter, referred to as PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a Processor (Processor) to execute some steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application shall be included in the scope of the present application.

Claims (16)

1. A method of determining an applet feature vector, wherein the method comprises:
extracting a plurality of characteristic character strings in sequence from program data of an applet, wherein the program data comprises a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet;
generating a sequence of signature strings for the applet from the plurality of signature strings;
converting the characteristic character string sequence of the applet into a characteristic character string vector;
inputting the characteristic character string vector into a trained deep learning model to generate a characteristic vector of the applet, wherein the deep learning model is a coding model for coding in a coding and decoding model;
before inputting the feature string vector into a trained deep learning model to generate a feature vector of the applet, obtaining the codec model using the following training method:
training a coding and decoding model by using a plurality of training vectors, wherein each training vector is the characteristic character string vector of a small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the characteristic character string vector to obtain the characteristic vector, the decoding model is used for decoding the characteristic vector to obtain an output vector, and the optimization goal of the coding and decoding model is trained to reduce the loss value calculated according to the output vector and the training vector;
and determining that a training convergence condition is reached to obtain the trained coding model.
2. The method of claim 1, wherein the converting the sequence of feature strings of the applet into a feature string vector comprises:
and replacing each characteristic character string in the characteristic character string sequence with a corresponding numerical index code according to the mapping relation between the character string and the numerical index code in a preset index mapping table to obtain the characteristic character string vector.
3. The method of claim 1, wherein, in the case that the program data includes a plurality of categories, the generating a sequence of feature strings for the applet from the plurality of feature strings comprises:
respectively extracting feature character strings not exceeding a preset number from the feature character strings corresponding to each type of program data;
and combining the extracted characteristic character strings to obtain the characteristic character string sequence.
4. A method according to any of claims 1-3, wherein the program data comprises a package file structure of the applet, and said extracting a plurality of characteristic strings in order in the program data of the applet comprises:
and extracting the file name and the file type suffix of each file according to the structure sequence of the package file structure to obtain a file name characteristic character string of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
5. The method of claim 4, wherein the generating a sequence of signature strings for the applet from the plurality of signature strings comprises:
extracting a character string of a suffix of a target file type from the file name characteristic character string obtained according to the package file structure to obtain a characteristic character string corresponding to the package file structure;
and generating the characteristic character string sequence according to the extracted character string.
6. A method according to any of claims 1-3, wherein the program data comprises a static code file of the applet, and said extracting a plurality of characteristic strings in sequence in the program data of the applet comprises:
selecting a plurality of target code files from the static code files of the small program;
matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string;
and splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
7. A method according to any of claims 1-3, wherein the program data comprises dynamic run data of the applet, said extracting a plurality of characteristic strings in sequence in the program data of the applet comprising:
running the applet;
capturing a request generated in the running process of the applet;
matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request;
and splitting the hit request to obtain the plurality of characteristic character strings.
8. The method according to claim 2 or 3, wherein before replacing each of the characteristic character strings in the characteristic character string sequence with the numeric index code of the corresponding character string according to the mapping relationship between the character string and the numeric index code in a preset index mapping table, the method further comprises:
determining unrepeated character strings which appear in the plurality of characteristic character strings and do not appear in the preset index mapping table to obtain unknown character strings;
distributing non-repeated numerical index codes for each unknown character string;
and storing the mapping relation between the unknown character string and the corresponding digital index code in the preset index mapping table so as to update the preset index mapping table.
9. The method of claim 8, wherein the generating a sequence of signature strings for the applet from the plurality of signature strings comprises:
calculating a word frequency-inverse text frequency index TF-IDF aiming at each character string in the updated preset index mapping table to obtain a score of each character string in the preset index mapping table;
and generating a characteristic character string sequence of the applet according to the characteristic character string with the score exceeding a preset score in the plurality of characteristic character strings.
10. The method of any of claims 1-3, wherein after inputting the feature string vector into a trained deep learning model to generate a feature vector for the applet, the method further comprises:
and determining the similarity of the small program and other small programs according to the feature vector of the small program and the feature vectors of other small programs.
11. The method of claim 10, wherein after determining the similarity of the applet to other applets, the method further comprises:
acquiring preset labels of other applets;
and determining the label of the small program according to the preset labels of other small programs.
12. The method of claim 11, wherein the preset tag is used to mark whether the corresponding applet is a malicious applet.
13. The method of any of claims 1-3, wherein after inputting the feature string vector into a trained deep learning model to generate a feature vector for the applet, the method further comprises:
determining a plurality of similarities between the applet and the other applets respectively according to the feature vector of the applet and the feature vectors of the other applets;
ranking the plurality of other applets based on the plurality of similarities;
and determining whether the small programs are malicious small programs or not according to the number of malicious small programs in the small programs with the preset number in the sequence.
14. An apparatus for determining an applet feature vector, wherein the apparatus comprises:
the extraction unit is used for extracting a plurality of characteristic character strings in sequence in the program data of the small program, wherein the program data comprises a package file structure of the small program, a static code file of the small program and dynamic operation data of the small program;
a first generation unit configured to generate a characteristic character string sequence of the applet from the plurality of characteristic character strings;
the conversion unit is used for converting the characteristic character string sequence of the applet into a characteristic character string vector;
a second generating unit, configured to input the feature string vector into a trained deep learning model to generate a feature vector of the applet, where the deep learning model is a coding model used for coding in a coding and decoding model;
the training unit is used for training a coding and decoding model by using a plurality of training vectors before the second generation unit generates the feature vectors of the small programs, wherein each training vector is a feature character string vector of one small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the training vectors to obtain output vectors, the decoding model is used for decoding the output vectors of the coding model to obtain the output vectors, and the optimization goal of the training coding and decoding model is to reduce loss values calculated according to the output vectors and the training vectors; and the second determining unit is used for determining that the training convergence condition is reached so as to obtain the trained coding model.
15. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 13.
16. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method of any of claims 1-13.
CN202110926708.XA 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment Active CN113656763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110926708.XA CN113656763B (en) 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110926708.XA CN113656763B (en) 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment
CN202010334290.9A CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202010334290.9A Division CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment

Publications (2)

Publication Number Publication Date
CN113656763A true CN113656763A (en) 2021-11-16
CN113656763B CN113656763B (en) 2024-01-09

Family

ID=70867606

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110926708.XA Active CN113656763B (en) 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment
CN202010334290.9A Active CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010334290.9A Active CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment

Country Status (1)

Country Link
CN (2) CN113656763B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860673A (en) * 2022-07-06 2022-08-05 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783095A (en) * 2020-07-28 2020-10-16 支付宝(杭州)信息技术有限公司 Method and device for identifying malicious code of applet and electronic equipment
CN113064627B (en) * 2021-03-23 2023-04-07 支付宝(杭州)信息技术有限公司 Service access data processing method, platform, terminal, equipment and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096144A1 (en) * 2015-11-17 2018-04-05 Wuhan Antiy Information Technology Co., Ltd. Method, system, and device for inferring malicious code rule based on deep learning method
CN109299273A (en) * 2018-11-02 2019-02-01 广州语义科技有限公司 Based on the multi-source multi-tag file classification method and its system for improving seq2seq model
CN110489555A (en) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 A kind of language model pre-training method of combination class word information

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885995A (en) * 2017-10-09 2018-04-06 阿里巴巴集团控股有限公司 The security sweep method, apparatus and electronic equipment of small routine
CN108959924A (en) * 2018-06-12 2018-12-07 浙江工业大学 A kind of Android malicious code detecting method of word-based vector sum deep neural network
CN110858288A (en) * 2018-08-24 2020-03-03 ***通信集团浙江有限公司 Abnormal behavior identification method and device
CN110059468B (en) * 2019-04-02 2023-09-26 创新先进技术有限公司 Applet risk identification method and device
CN110119621B (en) * 2019-05-05 2020-08-21 网御安全技术(深圳)有限公司 Attack defense method, system and defense device for abnormal system call
CN110414238A (en) * 2019-06-18 2019-11-05 中国科学院信息工程研究所 The search method and device of homologous binary code
CN110348214B (en) * 2019-07-16 2021-06-08 电子科技大学 Method and system for detecting malicious codes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096144A1 (en) * 2015-11-17 2018-04-05 Wuhan Antiy Information Technology Co., Ltd. Method, system, and device for inferring malicious code rule based on deep learning method
CN109299273A (en) * 2018-11-02 2019-02-01 广州语义科技有限公司 Based on the multi-source multi-tag file classification method and its system for improving seq2seq model
CN110489555A (en) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 A kind of language model pre-training method of combination class word information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张东等: "基于机器学习算法的主机恶意代码检测技术研究", 网络与信息安全学报, vol. 3, no. 7, pages 25 - 32 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860673A (en) * 2022-07-06 2022-08-05 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination

Also Published As

Publication number Publication date
CN111241496A (en) 2020-06-05
CN111241496B (en) 2021-06-29
CN113656763B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN110287278B (en) Comment generation method, comment generation device, server and storage medium
CN109657054B (en) Abstract generation method, device, server and storage medium
US10824874B2 (en) Method and apparatus for processing video
CN111241496B (en) Method and device for determining small program feature vector and electronic equipment
US9977770B2 (en) Conversion of a presentation to Darwin Information Typing Architecture (DITA)
US20180293302A1 (en) Natural question generation from query data using natural language processing system
CN111314388B (en) Method and apparatus for detecting SQL injection
CN111797272A (en) Video content segmentation and search
US20180336181A1 (en) Natural language processing of formatted documents
US11144569B2 (en) Operations to transform dataset to intent
US10803257B2 (en) Machine translation locking using sequence-based lock/unlock classification
CN113568626B (en) Dynamic packaging and application package opening method and device and electronic equipment
CN109033082B (en) Learning training method and device of semantic model and computer readable storage medium
CN112784596A (en) Method and device for identifying sensitive words
CN114743012B (en) Text recognition method and device
CN113127776A (en) Breadcrumb path generation method and device and terminal equipment
CN111898762B (en) Deep learning model catalog creation
CN113987496A (en) Malicious attack detection method and device, electronic equipment and readable storage medium
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
CN113935334A (en) Text information processing method, device, equipment and medium
CN113434695A (en) Financial event extraction method and device, electronic equipment and storage medium
CN113407264A (en) Image-based terminal interface identification method, device, equipment and medium
CN110737757B (en) Method and apparatus for generating information
CN108932326B (en) Instance extension method, device, equipment and medium
CN117493519A (en) Training method of text encoder, text generation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230105

Address after: 200120 Floor 15, No. 447, Nanquan North Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant after: Alipay.com Co.,Ltd.

Address before: 310000 801-11 section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant before: Alipay (Hangzhou) Information Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant