CN108171073A

CN108171073A - A kind of private data recognition methods based on the parsing driving of code layer semanteme

Info

Publication number: CN108171073A
Application number: CN201711277112.1A
Authority: CN
Inventors: 杨珉; 杨哲慜; 南雨宏; 张源; 朱东来
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2018-06-15
Anticipated expiration: 2037-12-06
Also published as: CN108171073B

Abstract

The invention belongs to program information safety detection technology field, specially a kind of private data recognition methods based on the parsing driving of code layer semanteme.The method of the present invention includes：The semantic analysis of privacy correlation and code snippet positioning based on natural language processing technique：Extract the character string constant identifier in code, after pretreatment, semantic information in character string constant is matched with the semantic associated privacy dictionary pre-defined, judge whether it shows specific private data by the dependence of part of speech label and different terms in sentence phrase in character string constant；Privacy correlative code segment identification based on machine learning：Using the supporting vector machine model of machine learning, whether the code characteristic behavior used by extracting private data is used as judges given code comprising system private data of interest.By being identified to this kind of private data, sensitive data source is marked as, so as to reduce the disclosure risk of privacy of user data.

Description

A kind of private data recognition methods based on the parsing driving of code layer semanteme

Technical field

The invention belongs to program information safety monitoring technology fields, and in particular to private data recognition methods.

Background technology

Traditional automation privacy leakage detection is only focused in the private data of particular system management and control, is such as directed to geographical location Information is merely able to specified single AP I（Such as getLastKnownLocation ()）As private data source, combining information later Flow point analyses to judge whether such private data has flowed to specific terminal（Such as network interface）, it is hidden so as to judge whether to form Private leakage.With the fast development of mobile application, traditional private data source can not be covered included in mobile application Many novel private datas.For example, except the privacy of removing system management and control, different applications has with itself using relevant hidden Private data, such as user account data, bank card data, sensitive historical record etc..These private datas and system permission There is no direct relations for model, are referred to as nonsystematic management and control private data in the present invention.

For the private data of such nonsystematic management and control, traditional information flow analysis tool is difficult to directly be identified Come.This is because different from traditional privacy source, the privacy of nonsystematic management and control is often other other than equipment itself Place so that it can not carry out unified directly mark from code angle.For example, many private datas come from user's input, Private data is transmitted to by way of EditText.getText () in program in registration or landfall process by user It, must if by getText (), this API is identified as private data source using traditional privacy source identification mode in portion It can so cause largely to report by mistake, this is because the data much got from interface might not include privacy of user（Such as input Commodity amount）.In addition, it is more such as used often from the cloud server where application itself using associated privacy data Work as after it is logged in using account using that can be cached to apply by HTTP request by the privacy of user data of server end at family In, it is used further to different scenes later.In this case, mark can be automated there has been no method to come from server Which data is privacy of user.

Invention content

The object of the present invention is to provide a kind of completely new private data recognition methods based on the parsing driving of code layer semanteme, Suitable for its nonsystematic management and control private data included is automatically identified in the code of application on a large scale.

Private data recognition methods proposed by the present invention based on the parsing driving of code layer semanteme, including two parts, when The semantic analysis of privacy correlation and code snippet positioning based on natural language processing technique, second is that the privacy phase based on machine learning Close code snippet identification.In the first portion, the character string constant identifier in code is extracted first（Such as constants etc.）, After by series of preprocessing, by the semantic information in character string constant and the semantic associated privacy dictionary pre-defined It is matched, passes through the part of speech label in character string constant（POS Tagging）And different terms are in sentence phrase Dependence judge whether it shows specific private data.In the second portion, using the supporting vector of machine learning Machine model, the code characteristic behavior used by extracting private data, which is used as, judges whether given code contains system and closed The private data of note.It is complementary to one another semantic information and code structure feature and the mark of private data is realized with reference to by way of Know.By the way that this kind of private data is identified, sensitive data source can be marked as, be such private data monitoring and Protection provides basis, so as to reduce the disclosure risk of privacy of user data.

Final design framework of the present invention is as shown in Figure 1, following two parts that will be described in detail the present invention：

First, the privacy correlation semantic analysis based on natural language processing technique and code snippet positioning, detailed process are as follows：

（1）Define privacy information：Whether the present invention first defines the relevant keyword of some privacies, and pass through in text and occur These keywords tentatively to judge whether text is privacy correlation；Keyword set is extracted by artificial screening.For example, come from paddy Sing the relevant keyword of privacy provided in privacy policy document, near synonym of these keywords and from 10000 Googles The word that extracted in the application of application market and these keywords have higher similarity is formed.

The present invention obtains 121 keywords through the above way, is divided into 4 types：User Attributers (user property), User Identifiers (user identity), Location (position), Account (account).As table 1 provides The privacy associative key of part of representative, the vocabulary also can dynamic configuration, expired by adding particular keywords The identification of foot novel private data any from now on.

（2）Extract semantic information：Since developer is in character string constant, the applications such as functional based method title and name variable Abundant semantic information is often written in code segment so that these information can become effective clue, pass through language The method of justice analysis finds private data that may be present in code.Based on this phenomenon, the application of the invention from decompiling Character string constant, function name, name variable are extracted in code (without obscuring, such as global static variable name).Later to this The semantic information obtained a bit carries out pretreatment operation, wherein, including removing the character other than wherein non-letter（Such as number, under Scribing line separator）, and by identify in these text messages commonly use separator and capitalization text is divided into it is multiple Entry, for example " user_addr " is resolved into " user " and " addr ", " GetUserPhoneNumber " is decomposed into " get These individual character strings of user phone number ".

（3）Position privacy correlation semantic information：After being pre-processed to the semantic information extracted, the present invention passes through Whether privacy is related tentatively to judge these semantic informations for the mode of natural language processing.The mistake based on keyword is employed successively Filter, the technologies such as the filtering based on part of speech and the filtering based on dependence step by step improve the effect of semantic analysis.

（3.1）Filtering based on keyword：In order to tentatively judge whether the semantic information extracted is that privacy is related, this hair It is bright using the relevant keyword of privacy described above, Keywords matching algorithm is used to judge semantic information whether to be hidden come preliminary It is private related.The Keywords matching algorithm is mainly deposited by checking for a keyword its each character It is in text to be processed, if it is present this section of text will be considered privacy correlation, and returns to the keyword.Matching The pseudocode of algorithm is shown in annex.

However can not complete to accurately identify privacy related data by Keywords matching algorithm, it is primarily due to very Although more character string constants include the relevant keyword of privacy, can not really show to include private data herein.Example Such as, developer often records some programs by log forms in code and analyzes state, " Mobihelp.setUserEmail () is not related to user's email data really although comprising " email " in requires a valid email ". In addition to this, the character string of many other forms also can be to judging whether that it is very serious dry that the judgement comprising private data causes It disturbs, for example, the character string constant comprising url, such as " mobile " are contained in " com/ironsource/mobilcore/ In MobileCoreReport ".In order to reduce these error messages, it is identified as the relevant semantic information of privacy in this step Whether will be further analyzed is privacy correlation.

（3.2）Filtering based on part of speech：Semantic part of speech marks to represent that specific keyword belongs in current sentence In which type of part of speech, such as noun or verb.In the analysis of the present invention, the part of speech corresponding to privacy related term of interest It needs for noun.Such as " Address " for identified geographic location address or mail address, then it should be noun（NN）, such as Fruit has the verb " Address " corresponding to " Address this issue " not meet filter condition then.Included in sentence Keyword when being identified as noun, which will further be done the analysis of dependence.

（3.3）Filtering based on dependence：Dependence is used for showing composition structural relation of the phrase between sentence, It, can be by judging that sensitive word analyzes privacy of interest with the dependence corresponding to other phrases for phrase or sentence Related phrase whether be the sentence center.For this purpose, present invention utilizes following dependences to meet matching filter condition：

（3.3.1）Directly description relationship (Dobj)：If keyword includes and directly describes relationship in analyzed phrase sentence, and Keyword is noun NN, then meets expection, such as " get email ", in addition, having in the description of serial number 1,2,3 in table 2 straight Connect the example of description relationship.

（3.3.2）Noun subject relationship (Nsubj):If keyword does not include straight with its context in analyzed phrase Description relationship is connect, but keyword is by dependent, then also complies with judgement and be expected, which tends to occur at not comprising complete words Phrase segment in.Such as " business phone number selected ".

（3.3.3）It negate modified relationship (Neg):If keyword is modified by a negative word in analyzed phrase, Then think, the keyword and privacy information are unrelated.Such as " Do not input your password here ".

（3.3.4）Other dependences：If keyword, which is only deposited, belongs to other dependences, such as compositive relation, then table Bright its only serves the effect of aid illustration in sentence, is not the subject of the word.Wherein, privacy associative key is often simultaneously The general idea of this non-.As there is the example of compositive relation in the description of serial number 1,2,3,6 in table 2.

Positioning based on semantic privacy correlative code segment is completed, and will be in the case where connecing by above-mentioned 3 step present invention The method of machine learning is used, whether the code characteristic behavior that is used by extracting private data is used as judges given code Contain system private data of interest.

2nd, the privacy related data identification based on code characteristic, detailed process are as follows：

After being filtered by the privacy correlation semantics recognition based on natural language processing technique, in order to identify privacy correlative code piece Whether really comprising private data, the present invention is analyzed the method being combined with machine learning using program, utilizes supporting vector section Machine SVM and pass through program and analyze extracted code characteristic and carry out private data identification.Specifically, first, pass through letter Breath flow point analysis is found by the relevant constant character string of privacy is confirmed as in semantic analysis or variable name is flowed into function tune With sentence, then, judge the function call sentence whether comprising privacy information using machine learning.When a function performs language Sentence（Line code）It is identified as privacy correlation, then the variable of data is stored in the code（Parameter or return value）Wrap Containing private data.

Feature Selection：The present invention chooses model vector of following five category features as identification private data correlative code：

Feature 1：Function name：For the api function for the function name that is not confused, complete function name has very abundant Semantic information shows the concrete meaning of function.For example, the function of operation data frequently includes the verbs such as set, get to show Storage/read-write data.Therefore, the feature of function name can equally assist identifying privacy of user data.Generation is chosen herein Common five verbs of set/get/put/add/insert and corresponding privacy item phrase are as characteristic dimension in code.

Feature 2：Function parameter type：Function parameter type tends to the service condition of reflection private data, for making It uses the relevant specific function of privacy and is often passed to certain types of parameter.For example, much preserve the behaviour of privacy of user data The character string of incoming String types is required to, such as function SaveUserAccount（String userAccount）, with this On the contrary, partial parameters type then shows that the function is likely to be not related to the private data of user, such as starts a new line Journey can be passed to the parameter of Thread types or open activity using Intent types as parameter.Therefore, different ginseng Several classes of types, by and combinations thereof in a manner of can reflect whether the function related to privacy of user data.

Feature 3：Function return value type：Function return value can equally embody the use feature of private data.It is for example, right In the related API for obtaining privacy of user data, corresponding data are often returned with String types.For storage, send The related API of privacy of user data, it is likely that return to the value of Boolean types to show whether code performs effectively.However, such as Fruit function returns other types unrelated with data, then is likely to show the function and not comprising privacy related data.

Feature 4：Function call reference variable type：The base class of function call, which equally has, embodies the feature that data use. For an invoke sentence, if base class is certain specific data structures, more likely show that the line code is using Privacy of user data.Such as in HashMap.get () function, show to have got certain from container set as HashMap Item data.In contrast, Exception.getException () is then to obtain a certain exception information, with user data not It is related.

Feature 5：Function parameter Value Types：In static code, there are two types of function parameter Value Types, is character string respectively Constant and string variable.Since privacy related data is often with semantic relevant text label, often with character The mode of string constant is embodied in the Parameter Value Type of call function.For example, HashMap.set (" username ", $ r1) In, Parameter Value Type is character string constant（StringConstant）With string variable（StringVariable）Composition Key-value pair.In addition, the permutation and combination method of Parameter Value Type also tends to embody the service condition of private data, such as big portion In the case of point, character string constant is located at the front of variable, such as saveInstance（useraccount, “username”, $ user）Show user name being stored in useraccount.In contrast, HandleException（$exception, “email”）It is likely to only report an error to email interrelated logics, there is no include real email in current code Data.

Training set：Since the present invention is using supervision property Machine learning classifiers, need to provide a certain amount of training data use In classifier training.Specifically, training set is by after " semantic analysis of privacy correlation and code snippet position " analysis The code obtained be unit, a certain number of codes are randomly selected by security expert, by manually mark confirm these with it is hidden Whether private relevant function statement really includes private data.There is enough coverages, sample number in order to reach training set According to total number should be at thousand or more.Meanwhile training dataset is it should be ensured that positive negative sample（Include private data code With not comprising private data code）Quantity totality relative equilibrium, the accuracy so as to which grader be made to reach best.

Grader selects：For the data set with good design feature vector and reasonable standard training sample, respectively The performance of a grader does not have too big gap.It is of the invention to select grader of the support vector machines as the present invention at present.Together When, the present invention is similary to be supported using arbitrary classification device, in combination with above extracted program code characteristic to private data into Row Classification and Identification, to realize optimal classification effect under different usage scenarios and sorting algorithm.

Above-mentioned grader is completed after training, for given program arbitrary code segment（Certain line code）, passing through It crosses after semantic analysis, by extracting the code characteristic of the foregoing description, the present invention can judge the code snippet by grader Whether private data is really included.

The beneficial effects of the invention are as follows：The present invention proposes that a kind of completely new analytic angle and analysis method carry out recognizer Privacy of user data in code.Specifically, by the present invention in that with the mode cognizance code based on natural language processing Semantic information positioning privacy correlative code segment in the middle, while code structure feature is used, come with reference to the mode of machine learning Judge whether be truly present private data in code snippet.It is private data source with traditional directly mark fixed system API, And assay surface information determines that user inputs private data and compares, the present invention has better versatility, and can identify Go out more private datas that method can not cover before.Such as come from remote server, and do not appear in interface Privacy.

Description of the drawings

Fig. 1：Overall system architecture figure.

Specific embodiment

The present invention has designed and Implemented the above-mentioned completely new privacy number being combined based on natural language processing with machine learning According to identification method.The specific implementation of this method is described in detail in this section.

First, the privacy correlation semantic analysis based on natural language processing technique and code snippet positioning,

The present invention analyzes application on the basis of FlowDroid tools.FlowDroid is the ripe peace realized based on Soot frames Zhuo Yingyong static analysis tools.Decompiling is carried out, and get the intermediate representation of application code to application using FlowDroid (Jimple formatted files).The present invention extracts character string constant and method name in the Jimple codes of decompiling later, Variable name is as the semantic information source to be analyzed.Meanwhile for character string constant, the present invention passes through process internal information flow point Analysis, these constant labels are transmitted in potential variable.

For the constant character string after extraction, present invention uses the Stanford Parser realized based on Java come into Row natural language processing is analyzed.Stanford Parser are common syntax parsing tools, can be directed to some sentence and parse Its structure simultaneously stamps part of speech label for participle unit different in sentence, and also provided is multiple each inside sentence for showing The method of dependence between participle unit.Therefore it chooses it and realizes morphological analysis and dependence analysis.

2nd, the privacy related data identification based on code characteristic

The present invention carries out static analysis using the code intermediate representation gone out from FlowDroid decompilings, so as to extract required 5 Category feature, and train grader using the Scikit-learn kits that the python used is realized.Simultaneously for training point Class device, the present invention, using in 100 popular applications on shop, are randomly selected by security expert to being judged as privacy phase from Google The function call sentence of pass is manually marked.In order to which the quantity for balancing positive and negative training set sample makes grader obtain most preferably accurately The negative sample not comprising private data of degree, 2163 positive samples comprising private data of selection and equivalent, 4326 altogether Training set of the training sample as this method.

Table 1

。

Table 2

。

Annex：Privacy correlation matching algorithm

。

Claims

1. a kind of private data recognition methods based on the parsing driving of code layer semanteme, which is characterized in that be divided into two parts：When The semantic analysis of privacy correlation and code snippet positioning based on natural language processing technique, second is that the privacy phase based on machine learning Close code snippet identification；

（1）Define privacy information：The relevant keyword of some privacies is defined first, and passes through in text and these keys whether occur Word tentatively to judge whether text is privacy correlation；Keyword set is extracted by artificial screening；

（2）Extract semantic information：Character string constant, function name, name variable are extracted from the application code of decompiling；It Pretreatment operation is carried out to the semantic information that these are obtained afterwards, including removing the character other than wherein non-letter, and passes through identification In these text messages commonly use separator and text is divided into multiple entries by capitalization；

（3）Position privacy correlation semantic information：Whether tentatively judge these semantic informations by way of natural language processing It is related to privacy：The filtering based on keyword, the filtering based on part of speech and the filtering technique based on dependence are used successively, Improve the effect of semantic analysis step by step：

（3.1）Filtering based on keyword：Using the relevant keyword of privacy, using Keywords matching algorithm come preliminary Judge whether semantic information is privacy correlation；The Keywords matching algorithm is mainly made by checking for a keyword Its each character is obtained to be present in text to be processed, if it is present this section of text will be considered privacy correlation, And return to the keyword；

（3.2）Filtering based on part of speech：Semantic part of speech marks to represent that specific keyword belongs to assorted in current sentence The part of speech of sample, in analysis, the part of speech corresponding to privacy related term of interest is noun, when crucial included in sentence When word is identified as noun, analysis which will further be done dependence；

（3.3）Filtering based on dependence：Dependence is used for showing composition structural relation of the phrase between sentence, for Phrase or sentence analyze the related phrase of privacy of interest by judging sensitive word and the dependence corresponding to other phrases Whether be the sentence center；Meet matching filter condition for dependence below：

（3.3.1）Relationship is directly described：If keyword includes and directly describes relationship in analyzed phrase sentence, and keyword For noun, then meet expection；

（3.3.2）Noun subject relationship：If keyword does not include with its context in analyzed phrase directly describes relationship, But keyword is by dependent, then also complies with judgement and be expected；

（3.3.3）It negate modified relationship：If keyword is by a negative word modification in analyzed phrase, then it is assumed that, it should Keyword and privacy information are unrelated；

（3.3.4）Other dependences：If keyword, which is only deposited, belongs to other dependences, show that it only rises in sentence It is not the subject of the word to the effect of aid illustration；

First, the relevant constant character string of privacy or variable name institute are confirmed as by semantic analysis to find by information flow analysis Then whether the function call sentence being flowed into, judges the function call sentence comprising privacy information using machine learning；Such as One function of fruit performs sentence and is identified as privacy correlation, then the variable that data are stored in the code includes privacy number According to.

2. the private data recognition methods according to claim 1 based on the parsing driving of code layer semanteme, which is characterized in that In step 2, model vector of following five category features as identification private data correlative code is chosen：Function name, function parameter Type, function return value type, function call reference variable type, function parameter Value Types；

The training set of the machine learning, by being obtained later by " semantic analysis of privacy correlation and code snippet position " analysis Code is unit, and a certain number of codes are randomly selected by security expert, confirms that these are related to privacy by manually marking Function statement whether really include private data；In order to make training set that there is enough coverages, the totality of sample data Quantity is at thousand or more；Meanwhile training dataset is not it should be ensured that positive negative sample comprising private data code and includes privacy The quantity totality relative equilibrium of data code, makes grader reach best accuracy.

3. the private data recognition methods according to claim 2 based on the parsing driving of code layer semanteme, which is characterized in that From Google using in 100 popular applications on shop, randomly selected by security expert to being judged as the relevant function tune of privacy It is manually marked with sentence；Choose 4326 training samples, the positive sample and equivalent for including private data including 2163 The negative sample not comprising private data, as training set；

Support vector machines are selected as grader.