CN108647497A - A kind of API key automatic recognition systems of feature based extraction - Google Patents

A kind of API key automatic recognition systems of feature based extraction Download PDF

Info

Publication number
CN108647497A
CN108647497A CN201810403303.6A CN201810403303A CN108647497A CN 108647497 A CN108647497 A CN 108647497A CN 201810403303 A CN201810403303 A CN 201810403303A CN 108647497 A CN108647497 A CN 108647497A
Authority
CN
China
Prior art keywords
feature
classification
extraction
api
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810403303.6A
Other languages
Chinese (zh)
Inventor
黄诚
方勇
刘亮
薛敏
赵翠镕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201810403303.6A priority Critical patent/CN108647497A/en
Publication of CN108647497A publication Critical patent/CN108647497A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • G06F21/121Restricting unauthorised execution of programs
    • G06F21/125Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention designs a kind of system that can carry out automatic identification to the API keys in source code of feature based extraction, the common feature of the API keys in source code by building different programming languages carries out rapidly and accurately automatic identification using the sorting algorithm based on machine learning to the API keys in source code.

Description

A kind of API key automatic recognition systems of feature based extraction
Technical field
What the present invention designed a kind of extraction of feature based can be to what the API keys in source code carried out automatic identification System uses the classification based on machine learning by building the common feature of the API keys in the source code in different programming languages Algorithm automatically and accurately detects the purpose of API keys in source code to realize.
Background technology
With the development of Open Source Code community and version control system, more and more item codes comprising API keys are not It is just uploaded to hosted platform through desensitization process.API keys are normally used as a kind of authentication authority, and leakage may cause Related service is by malicious exploitation, to cause economic loss difficult to the appraisal.Information of Development science and technology has indispensability, payes attention to network Safety has importance, safeguards that internet security has increased to State-level, how automatically and accurately to different programming languages Source code in API keys be identified to become and prevent a urgent problem to be solved in information leakage field.
Under the high speed development of internet, various API leakages events emerge one after another.2013, up to ten thousand Amazon Web Service(AWS)Key is found to be uploaded to Open Source Code warehouse GitHub, and part API keys are maliciously adjusted With;The Uber keys being found can be used for sending Uber notifications.The leakage of these API keys may cause malicious user with The identity of developer freely uses related service, in turn results in related service and is abused.
The generation that current existing API keys identifying system, such as " Truffle Hog " tool pass through retrieval GitHub code libraries Code submits record and branch that can detect doubtful API keys in conjunction with the high entropy characteristic of key.But the deteminate machine of this mode Make single, rate of false alarm is higher.And other detection modes:Simple mode matching, it is heuristic filtering and be based on source code program slice Etc. modes, detection it is efficient all universal relatively low.Although these types of mode be used in combination can reach 100% accuracy, inspection It surveys only for AWS and Facebook keys, and sample number is less, it is representative low, with less preferable applicability.Thus may be used See, API keys are realized using feature extraction from machine learning automatically and accurately close to API in the source code in different programming languages Key identification has important research significance.
The problem that feature extraction and automatic identification for API keys mainly solve is:
(1)How API key strings suspicious in the source code in different programming languages are analyzed and screened.
(2)How essential characteristic, random nature extraction are carried out to the suspicious API keys of extraction, to where character string Source code structure feature extracts.
(3)How to choose optimal feature subset and generate decision tree, builds assembled classifier, realize the function of Classification and Identification.
This system emphasis solves three above problem, realizes the API key automatic identifications of a feature extraction System.
Invention content
The invention is using artificial screening and classification marker, the API keys randomness spy based on comentropy and log-likelihood Levy extractive technique, the source code structure feature extractive technique based on attribute count combination COS distance calculation code attributes similarity, The advanced system of the multinomial technological development such as the sorting algorithm based on random forest, by being pre-processed to sample data, to it In suspicious API key strings carry out feature extraction, the source code in different programming languages is identified by sorting group clutch In API keys.
The invention aims at following target:(1)System turns the code file training in the project of different programming languages It changes, suspicious character string is obtained according to certain Rules Filtering.The classification and ballot that system is prepared data processed result, know The API keys in the code inputted in current system are not obtained.
(2)System can pre-process the sample code being collected into, and can obtain the code process of different programming languages To suspicious character string.Have and sample training is generated into transition matrix, code file combing, code analysis, suspicious character string selection Handle processing capacity.
(3)System extracts the random nature of character string using the calculation based on comentropy and log-likelihood Volume extracts source code structure feature using the mode of attribute count combination COS distance calculation code attributes similarity.
(4)System has pretreatment potentiality, can be by sample training teachers and students at single order Markov transition matrix, to project Sample code carries out classification marker, is come out the suspicious character string selection in code according to certain rule.
(5)System has ability in feature extraction, can by carrying out comentropy and log-likelihood calculations to suspicious character string, Carry out random nature extraction, for the sound code file where character string, by using the mode of attribute count, in conjunction with cosine away from Source code structure feature is extracted from calculation code attributes similarity.
To achieve the goals above, which uses following technical solution:The API keys identification system of feature based extraction System includes mainly three parts:Data prediction, feature extraction, Classification and Identification
Data prediction includes mainly training sample set, classification marker and suspicious character string selection.This part of system is first to net The project sample code being collected into network system carries out initialization process, classification marker.Further according to certain rule to suspicious word Symbol string carries out screening and is sent into the feature extractor based on machine learning.
Feature Selection Model obtains suspicious character string mainly for data prediction, by building the substantially special of API keys Sign extracts it, is extracted to random nature by way of based on comentropy and log-likelihood, by character Source code where string carries out carrying out source code structure based on the mode of attribute radix combination COS distance calculation code attributes similarity Feature extraction.
Classification and Identification module is mainly trained the feature extracted, and optimal feature is chosen according to Gini coefficient Collection, constructs more sub-trees, assembled classifier is built with this.The all uses of the training to each subtree are same as to put at random The mode of pumpback sample to prevent overfitting, and votes to the classification results of each sub-tree, with voting results Classification and Identification result as model.
Description of the drawings
Fig. 1 is the API key automatic identification illustratons of model of the present invention
Specific implementation mode:
The API key automatic recognition systems of feature based extraction include three modules:Data preprocessing module, feature extraction and Processing module, Classification and Identification module.
It is the API automatic identification illustratons of model of system as shown in Figure 1, describes the correlation of API key identifying systems in detail Design and deployment framework.By being combed to the item file in sample, by certain rule analysis and suspicious character is filtered out The original sample of system is passed through data prediction by string, and characteristic extracting module carries out essential characteristic, randomness to suspicious character string API key identifying systems three classes totally 7 are completed in feature extraction, and carry out structure feature extraction to the source code where suspicious character string The extraction of a feature.
The present invention the course of work be:
Feature is extracted to API keys using a series of mathematical methods, uses the calculating side based on comentropy and log-likelihood Formula extracts the random nature of API keys, uses the mode combination COS distance calculation code attribute phase of attribute count Source code structure feature is extracted like the method for degree.Simultaneously using based on random forests algorithm to the three classes of selection totally 7 spies Sign is trained, and is chosen optimal feature subset according to Gini coefficient, is constructed more sub-trees, assembled classification is built with this Device, to the training of each subtree all by the way of random sampling with replacement, the case where preventing overfitting.And to each A sub-tree votes to its classification results, finally using voting results as Classification and Identification result.
Wherein, the API identification model improved, process of feature based extraction is as follows:
The characteristic extraction procedure of improved API keys does not use simple mode matching, and heuristic filtering or source code program are cut The modes such as piece.But based on above several research modes, essential characteristic statistics, source code static structural analysis are carried out to sample. String length, special tax character accounting, the essential characteristic of digital accounting, vowel character accounting as API keys are chosen, meanwhile, The random nature of API keys is extracted by two kinds of calculations of comentropy and log-likelihood, uses attribute count knot The mode for closing COS distance calculation code attributes similarity extracts static source code structure feature.And due to three category features The codomain of value not necessarily, will also be normalized characteristic value, be influenced caused by taxonomic structure with reducing codomain difference.
Subsequent probability between API key strings is very low, leads to calculated P(X)It is minimum, so that computer-internal Not enough position indicates, and it is substantially zeroed.In order to improve this numerical underflows problem, using log-likelihood estimate into Row processing.Possibility predication is numerically equal with corresponding probability, and logarithm process also has no effect on the monotonicity of former possibility predication.
Totally seven features are trained three classes of the random forests algorithm used in model to selection, are selected according to Gini coefficient Optimal feature subset is taken, more sub-trees is constructed, assembled classifier is built with this.The training of each subtree is all used The mode of sampling with replacement, can effectively prevent and the case where overfitting occur.When exporting new sample to be tested, determine to each height Plan tree votes to classification results, finally using voting results as the classification results of model.

Claims (4)

1. the invention discloses a kind of API key automatic identification models of feature based extraction, feature includes the following steps:
(1)Step 1:Preprocessed data combs the code file in project, and going out further according to certain Rules Filtering can Doubt character string;
(2)Step 2:By the suspicious character string obtained by pretreatment with certain mathematical method carries out essential characteristic at random Property feature extraction, extracts the source code structure feature where suspicious character string;
(3)Step 3:The three classes characteristic value extracted is normalized, taxonomic structure is caused with reducing codomain difference Influence;
(4)Step 4:Totally 7 features are trained three classes based on the classifying identification method of random forest to extraction, structure group Grader is closed, and is voted classification results, automatic identification API keys;
(5)Step 5:Multistage threat identity is established, the data progress identity information threatened will be generated to system and is extracted and preserved In identity characteristic library, while behavior judgement is instructed in identity characteristic library;
(6)Step 6:It chooses optimal feature subset and constructs more stalk decision trees, the source for including API keys that will be inputted to system Code is pre-processed, then is extracted and be stored in feature database to essential characteristic, random nature, source code structure feature, Feature will vote to classification results and instruct simultaneously.
2. the multilevel policy decision tree identification of the feature extraction and assembled classification structure according to claim 1 based on machine learning Device, it is characterised in that:Sample process mode based on artificial screening and classification marker;Based on suspicious character string and sound code file Feature extraction, includes to the progress essential characteristic extraction of suspicious character string and the random nature based on comentropy and log-likelihood carries Method is taken, the source code structure feature based on attribute count combination COS distance calculation code attributes similarity is carried out to sound code file Extraction;Classification and Identification mode based on random forest builds assembled classifier by constructing more stalk decision trees with this;It is logical It crosses and ballot is carried out as Classification and Identification result to classification results.
3. the API key automatic identifications according to present claims 1, it is characterised in that:To inputting the item of different programming languages Mesh code is pre-processed, and API cipher key features are obtained, including essential characteristic, random nature and source code feature are total to 7 spies of three classes Sign;Optimal feature subset is chosen according to Gini coefficient and constructs more stalk decision trees, builds assembled classification identifier.
4. the pretreated model according to claim 1 based on artificial screening and classification marker, it is characterised in that:According to language Expect that library generates single order Markov transition matrix, the code file concentrated to test sample carries out the combing based on suffix name, root Code file is analyzed according to certain rule, further filters out suspicious character string.
CN201810403303.6A 2018-04-28 2018-04-28 A kind of API key automatic recognition systems of feature based extraction Pending CN108647497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810403303.6A CN108647497A (en) 2018-04-28 2018-04-28 A kind of API key automatic recognition systems of feature based extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810403303.6A CN108647497A (en) 2018-04-28 2018-04-28 A kind of API key automatic recognition systems of feature based extraction

Publications (1)

Publication Number Publication Date
CN108647497A true CN108647497A (en) 2018-10-12

Family

ID=63748240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810403303.6A Pending CN108647497A (en) 2018-04-28 2018-04-28 A kind of API key automatic recognition systems of feature based extraction

Country Status (1)

Country Link
CN (1) CN108647497A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159697A (en) * 2019-12-27 2020-05-15 支付宝(杭州)信息技术有限公司 Key detection method and device and electronic equipment
CN112702157A (en) * 2020-12-04 2021-04-23 河南大学 Block cipher system identification method based on improved random forest algorithm
CN114417422A (en) * 2022-01-26 2022-04-29 湖南快乐阳光互动娱乐传媒有限公司 Automatic protection method and device for sensitive information in code warehouse

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN204759452U (en) * 2015-07-09 2015-11-11 华南理工大学 Traffic conflagration detecting system based on many characteristics of video smog fuse
CN105389585A (en) * 2015-10-20 2016-03-09 深圳大学 Random forest optimization method and system based on tensor decomposition
CN106105100A (en) * 2014-03-18 2016-11-09 Twc专利信托公司 Low delay, high capacity, high power capacity API gateway
US20160344543A1 (en) * 2015-05-19 2016-11-24 Coinbase, Inc. Security system forming part of a bitcoin host computer
CN107302474A (en) * 2017-07-04 2017-10-27 四川无声信息技术有限公司 The feature extracting method and device of network data application

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106105100A (en) * 2014-03-18 2016-11-09 Twc专利信托公司 Low delay, high capacity, high power capacity API gateway
US20160344543A1 (en) * 2015-05-19 2016-11-24 Coinbase, Inc. Security system forming part of a bitcoin host computer
CN204759452U (en) * 2015-07-09 2015-11-11 华南理工大学 Traffic conflagration detecting system based on many characteristics of video smog fuse
CN105389585A (en) * 2015-10-20 2016-03-09 深圳大学 Random forest optimization method and system based on tensor decomposition
CN107302474A (en) * 2017-07-04 2017-10-27 四川无声信息技术有限公司 The feature extracting method and device of network data application

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
薛敏 等: "源代码中API密钥自动识别技术研究", 《HTTP://KNS.CNKI.NET/KCMS/DETAIL/31.1289.TP.20170925.1717.008.HTML》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159697A (en) * 2019-12-27 2020-05-15 支付宝(杭州)信息技术有限公司 Key detection method and device and electronic equipment
CN111159697B (en) * 2019-12-27 2022-06-03 支付宝(杭州)信息技术有限公司 Key detection method and device and electronic equipment
CN112702157A (en) * 2020-12-04 2021-04-23 河南大学 Block cipher system identification method based on improved random forest algorithm
CN114417422A (en) * 2022-01-26 2022-04-29 湖南快乐阳光互动娱乐传媒有限公司 Automatic protection method and device for sensitive information in code warehouse

Similar Documents

Publication Publication Date Title
CN111639497B (en) Abnormal behavior discovery method based on big data machine learning
CN111738011A (en) Illegal text recognition method and device, storage medium and electronic device
CN106570513A (en) Fault diagnosis method and apparatus for big data network system
CN112905421A (en) Container abnormal behavior detection method of LSTM network based on attention mechanism
CN107360152A (en) A kind of Web based on semantic analysis threatens sensory perceptual system
CN111259219B (en) Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system
CN109872162A (en) A kind of air control classifying identification method and system handling customer complaint information
CN111143838B (en) Database user abnormal behavior detection method
CN108647497A (en) A kind of API key automatic recognition systems of feature based extraction
CN108229170A (en) Utilize big data and the software analysis method and device of neural network
CN109067800A (en) A kind of cross-platform association detection method of firmware loophole
CN116361815B (en) Code sensitive information and hard coding detection method and device based on machine learning
CN112148997A (en) Multi-modal confrontation model training method and device for disaster event detection
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
CN115277180A (en) Block chain log anomaly detection and tracing system
CN116318830A (en) Log intrusion detection system based on generation of countermeasure network
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
CN110889451A (en) Event auditing method and device, terminal equipment and storage medium
CN109753798A (en) A kind of Webshell detection model based on random forest and FastText
CN112257076A (en) Vulnerability detection method based on random detection algorithm and information aggregation
CN111783063A (en) Operation verification method and device
CN116226769A (en) Short video abnormal behavior recognition method based on user behavior sequence
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
CN115842645A (en) UMAP-RF-based network attack traffic detection method and device and readable storage medium
CN113722230A (en) Integrated assessment method and device for vulnerability mining capability of fuzzy test tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181012

WD01 Invention patent application deemed withdrawn after publication