CN108647497A - A kind of API key automatic recognition systems of feature based extraction - Google Patents
A kind of API key automatic recognition systems of feature based extraction Download PDFInfo
- Publication number
- CN108647497A CN108647497A CN201810403303.6A CN201810403303A CN108647497A CN 108647497 A CN108647497 A CN 108647497A CN 201810403303 A CN201810403303 A CN 201810403303A CN 108647497 A CN108647497 A CN 108647497A
- Authority
- CN
- China
- Prior art keywords
- feature
- classification
- extraction
- api
- code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 29
- 238000010801 machine learning Methods 0.000 claims abstract description 5
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 41
- 238000000034 method Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 7
- 239000003550 marker Substances 0.000 claims description 6
- 238000003066 decision tree Methods 0.000 claims description 5
- 238000007637 random forest analysis Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000012067 mathematical method Methods 0.000 claims description 2
- 210000001520 comb Anatomy 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 abstract description 5
- 238000013461 design Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 241000609666 Tuber aestivum Species 0.000 description 1
- 238000000586 desensitisation Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/10—Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
- G06F21/12—Protecting executable software
- G06F21/121—Restricting unauthorised execution of programs
- G06F21/125—Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Technology Law (AREA)
- Computer Hardware Design (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention designs a kind of system that can carry out automatic identification to the API keys in source code of feature based extraction, the common feature of the API keys in source code by building different programming languages carries out rapidly and accurately automatic identification using the sorting algorithm based on machine learning to the API keys in source code.
Description
Technical field
What the present invention designed a kind of extraction of feature based can be to what the API keys in source code carried out automatic identification
System uses the classification based on machine learning by building the common feature of the API keys in the source code in different programming languages
Algorithm automatically and accurately detects the purpose of API keys in source code to realize.
Background technology
With the development of Open Source Code community and version control system, more and more item codes comprising API keys are not
It is just uploaded to hosted platform through desensitization process.API keys are normally used as a kind of authentication authority, and leakage may cause
Related service is by malicious exploitation, to cause economic loss difficult to the appraisal.Information of Development science and technology has indispensability, payes attention to network
Safety has importance, safeguards that internet security has increased to State-level, how automatically and accurately to different programming languages
Source code in API keys be identified to become and prevent a urgent problem to be solved in information leakage field.
Under the high speed development of internet, various API leakages events emerge one after another.2013, up to ten thousand
Amazon Web Service(AWS)Key is found to be uploaded to Open Source Code warehouse GitHub, and part API keys are maliciously adjusted
With;The Uber keys being found can be used for sending Uber notifications.The leakage of these API keys may cause malicious user with
The identity of developer freely uses related service, in turn results in related service and is abused.
The generation that current existing API keys identifying system, such as " Truffle Hog " tool pass through retrieval GitHub code libraries
Code submits record and branch that can detect doubtful API keys in conjunction with the high entropy characteristic of key.But the deteminate machine of this mode
Make single, rate of false alarm is higher.And other detection modes:Simple mode matching, it is heuristic filtering and be based on source code program slice
Etc. modes, detection it is efficient all universal relatively low.Although these types of mode be used in combination can reach 100% accuracy, inspection
It surveys only for AWS and Facebook keys, and sample number is less, it is representative low, with less preferable applicability.Thus may be used
See, API keys are realized using feature extraction from machine learning automatically and accurately close to API in the source code in different programming languages
Key identification has important research significance.
The problem that feature extraction and automatic identification for API keys mainly solve is:
(1)How API key strings suspicious in the source code in different programming languages are analyzed and screened.
(2)How essential characteristic, random nature extraction are carried out to the suspicious API keys of extraction, to where character string
Source code structure feature extracts.
(3)How to choose optimal feature subset and generate decision tree, builds assembled classifier, realize the function of Classification and Identification.
This system emphasis solves three above problem, realizes the API key automatic identifications of a feature extraction
System.
Invention content
The invention is using artificial screening and classification marker, the API keys randomness spy based on comentropy and log-likelihood
Levy extractive technique, the source code structure feature extractive technique based on attribute count combination COS distance calculation code attributes similarity,
The advanced system of the multinomial technological development such as the sorting algorithm based on random forest, by being pre-processed to sample data, to it
In suspicious API key strings carry out feature extraction, the source code in different programming languages is identified by sorting group clutch
In API keys.
The invention aims at following target:(1)System turns the code file training in the project of different programming languages
It changes, suspicious character string is obtained according to certain Rules Filtering.The classification and ballot that system is prepared data processed result, know
The API keys in the code inputted in current system are not obtained.
(2)System can pre-process the sample code being collected into, and can obtain the code process of different programming languages
To suspicious character string.Have and sample training is generated into transition matrix, code file combing, code analysis, suspicious character string selection
Handle processing capacity.
(3)System extracts the random nature of character string using the calculation based on comentropy and log-likelihood
Volume extracts source code structure feature using the mode of attribute count combination COS distance calculation code attributes similarity.
(4)System has pretreatment potentiality, can be by sample training teachers and students at single order Markov transition matrix, to project
Sample code carries out classification marker, is come out the suspicious character string selection in code according to certain rule.
(5)System has ability in feature extraction, can by carrying out comentropy and log-likelihood calculations to suspicious character string,
Carry out random nature extraction, for the sound code file where character string, by using the mode of attribute count, in conjunction with cosine away from
Source code structure feature is extracted from calculation code attributes similarity.
To achieve the goals above, which uses following technical solution:The API keys identification system of feature based extraction
System includes mainly three parts:Data prediction, feature extraction, Classification and Identification
Data prediction includes mainly training sample set, classification marker and suspicious character string selection.This part of system is first to net
The project sample code being collected into network system carries out initialization process, classification marker.Further according to certain rule to suspicious word
Symbol string carries out screening and is sent into the feature extractor based on machine learning.
Feature Selection Model obtains suspicious character string mainly for data prediction, by building the substantially special of API keys
Sign extracts it, is extracted to random nature by way of based on comentropy and log-likelihood, by character
Source code where string carries out carrying out source code structure based on the mode of attribute radix combination COS distance calculation code attributes similarity
Feature extraction.
Classification and Identification module is mainly trained the feature extracted, and optimal feature is chosen according to Gini coefficient
Collection, constructs more sub-trees, assembled classifier is built with this.The all uses of the training to each subtree are same as to put at random
The mode of pumpback sample to prevent overfitting, and votes to the classification results of each sub-tree, with voting results
Classification and Identification result as model.
Description of the drawings
Fig. 1 is the API key automatic identification illustratons of model of the present invention
Specific implementation mode:
The API key automatic recognition systems of feature based extraction include three modules:Data preprocessing module, feature extraction and
Processing module, Classification and Identification module.
It is the API automatic identification illustratons of model of system as shown in Figure 1, describes the correlation of API key identifying systems in detail
Design and deployment framework.By being combed to the item file in sample, by certain rule analysis and suspicious character is filtered out
The original sample of system is passed through data prediction by string, and characteristic extracting module carries out essential characteristic, randomness to suspicious character string
API key identifying systems three classes totally 7 are completed in feature extraction, and carry out structure feature extraction to the source code where suspicious character string
The extraction of a feature.
The present invention the course of work be:
Feature is extracted to API keys using a series of mathematical methods, uses the calculating side based on comentropy and log-likelihood
Formula extracts the random nature of API keys, uses the mode combination COS distance calculation code attribute phase of attribute count
Source code structure feature is extracted like the method for degree.Simultaneously using based on random forests algorithm to the three classes of selection totally 7 spies
Sign is trained, and is chosen optimal feature subset according to Gini coefficient, is constructed more sub-trees, assembled classification is built with this
Device, to the training of each subtree all by the way of random sampling with replacement, the case where preventing overfitting.And to each
A sub-tree votes to its classification results, finally using voting results as Classification and Identification result.
Wherein, the API identification model improved, process of feature based extraction is as follows:
The characteristic extraction procedure of improved API keys does not use simple mode matching, and heuristic filtering or source code program are cut
The modes such as piece.But based on above several research modes, essential characteristic statistics, source code static structural analysis are carried out to sample.
String length, special tax character accounting, the essential characteristic of digital accounting, vowel character accounting as API keys are chosen, meanwhile,
The random nature of API keys is extracted by two kinds of calculations of comentropy and log-likelihood, uses attribute count knot
The mode for closing COS distance calculation code attributes similarity extracts static source code structure feature.And due to three category features
The codomain of value not necessarily, will also be normalized characteristic value, be influenced caused by taxonomic structure with reducing codomain difference.
Subsequent probability between API key strings is very low, leads to calculated P(X)It is minimum, so that computer-internal
Not enough position indicates, and it is substantially zeroed.In order to improve this numerical underflows problem, using log-likelihood estimate into
Row processing.Possibility predication is numerically equal with corresponding probability, and logarithm process also has no effect on the monotonicity of former possibility predication.
Totally seven features are trained three classes of the random forests algorithm used in model to selection, are selected according to Gini coefficient
Optimal feature subset is taken, more sub-trees is constructed, assembled classifier is built with this.The training of each subtree is all used
The mode of sampling with replacement, can effectively prevent and the case where overfitting occur.When exporting new sample to be tested, determine to each height
Plan tree votes to classification results, finally using voting results as the classification results of model.
Claims (4)
1. the invention discloses a kind of API key automatic identification models of feature based extraction, feature includes the following steps:
(1)Step 1:Preprocessed data combs the code file in project, and going out further according to certain Rules Filtering can
Doubt character string;
(2)Step 2:By the suspicious character string obtained by pretreatment with certain mathematical method carries out essential characteristic at random
Property feature extraction, extracts the source code structure feature where suspicious character string;
(3)Step 3:The three classes characteristic value extracted is normalized, taxonomic structure is caused with reducing codomain difference
Influence;
(4)Step 4:Totally 7 features are trained three classes based on the classifying identification method of random forest to extraction, structure group
Grader is closed, and is voted classification results, automatic identification API keys;
(5)Step 5:Multistage threat identity is established, the data progress identity information threatened will be generated to system and is extracted and preserved
In identity characteristic library, while behavior judgement is instructed in identity characteristic library;
(6)Step 6:It chooses optimal feature subset and constructs more stalk decision trees, the source for including API keys that will be inputted to system
Code is pre-processed, then is extracted and be stored in feature database to essential characteristic, random nature, source code structure feature,
Feature will vote to classification results and instruct simultaneously.
2. the multilevel policy decision tree identification of the feature extraction and assembled classification structure according to claim 1 based on machine learning
Device, it is characterised in that:Sample process mode based on artificial screening and classification marker;Based on suspicious character string and sound code file
Feature extraction, includes to the progress essential characteristic extraction of suspicious character string and the random nature based on comentropy and log-likelihood carries
Method is taken, the source code structure feature based on attribute count combination COS distance calculation code attributes similarity is carried out to sound code file
Extraction;Classification and Identification mode based on random forest builds assembled classifier by constructing more stalk decision trees with this;It is logical
It crosses and ballot is carried out as Classification and Identification result to classification results.
3. the API key automatic identifications according to present claims 1, it is characterised in that:To inputting the item of different programming languages
Mesh code is pre-processed, and API cipher key features are obtained, including essential characteristic, random nature and source code feature are total to 7 spies of three classes
Sign;Optimal feature subset is chosen according to Gini coefficient and constructs more stalk decision trees, builds assembled classification identifier.
4. the pretreated model according to claim 1 based on artificial screening and classification marker, it is characterised in that:According to language
Expect that library generates single order Markov transition matrix, the code file concentrated to test sample carries out the combing based on suffix name, root
Code file is analyzed according to certain rule, further filters out suspicious character string.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810403303.6A CN108647497A (en) | 2018-04-28 | 2018-04-28 | A kind of API key automatic recognition systems of feature based extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810403303.6A CN108647497A (en) | 2018-04-28 | 2018-04-28 | A kind of API key automatic recognition systems of feature based extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108647497A true CN108647497A (en) | 2018-10-12 |
Family
ID=63748240
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810403303.6A Pending CN108647497A (en) | 2018-04-28 | 2018-04-28 | A kind of API key automatic recognition systems of feature based extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647497A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159697A (en) * | 2019-12-27 | 2020-05-15 | 支付宝(杭州)信息技术有限公司 | Key detection method and device and electronic equipment |
CN112702157A (en) * | 2020-12-04 | 2021-04-23 | 河南大学 | Block cipher system identification method based on improved random forest algorithm |
CN114417422A (en) * | 2022-01-26 | 2022-04-29 | 湖南快乐阳光互动娱乐传媒有限公司 | Automatic protection method and device for sensitive information in code warehouse |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN204759452U (en) * | 2015-07-09 | 2015-11-11 | 华南理工大学 | Traffic conflagration detecting system based on many characteristics of video smog fuse |
CN105389585A (en) * | 2015-10-20 | 2016-03-09 | 深圳大学 | Random forest optimization method and system based on tensor decomposition |
CN106105100A (en) * | 2014-03-18 | 2016-11-09 | Twc专利信托公司 | Low delay, high capacity, high power capacity API gateway |
US20160344543A1 (en) * | 2015-05-19 | 2016-11-24 | Coinbase, Inc. | Security system forming part of a bitcoin host computer |
CN107302474A (en) * | 2017-07-04 | 2017-10-27 | 四川无声信息技术有限公司 | The feature extracting method and device of network data application |
-
2018
- 2018-04-28 CN CN201810403303.6A patent/CN108647497A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106105100A (en) * | 2014-03-18 | 2016-11-09 | Twc专利信托公司 | Low delay, high capacity, high power capacity API gateway |
US20160344543A1 (en) * | 2015-05-19 | 2016-11-24 | Coinbase, Inc. | Security system forming part of a bitcoin host computer |
CN204759452U (en) * | 2015-07-09 | 2015-11-11 | 华南理工大学 | Traffic conflagration detecting system based on many characteristics of video smog fuse |
CN105389585A (en) * | 2015-10-20 | 2016-03-09 | 深圳大学 | Random forest optimization method and system based on tensor decomposition |
CN107302474A (en) * | 2017-07-04 | 2017-10-27 | 四川无声信息技术有限公司 | The feature extracting method and device of network data application |
Non-Patent Citations (1)
Title |
---|
薛敏 等: "源代码中API密钥自动识别技术研究", 《HTTP://KNS.CNKI.NET/KCMS/DETAIL/31.1289.TP.20170925.1717.008.HTML》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159697A (en) * | 2019-12-27 | 2020-05-15 | 支付宝(杭州)信息技术有限公司 | Key detection method and device and electronic equipment |
CN111159697B (en) * | 2019-12-27 | 2022-06-03 | 支付宝(杭州)信息技术有限公司 | Key detection method and device and electronic equipment |
CN112702157A (en) * | 2020-12-04 | 2021-04-23 | 河南大学 | Block cipher system identification method based on improved random forest algorithm |
CN114417422A (en) * | 2022-01-26 | 2022-04-29 | 湖南快乐阳光互动娱乐传媒有限公司 | Automatic protection method and device for sensitive information in code warehouse |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639497B (en) | Abnormal behavior discovery method based on big data machine learning | |
CN111738011A (en) | Illegal text recognition method and device, storage medium and electronic device | |
CN106570513A (en) | Fault diagnosis method and apparatus for big data network system | |
CN112905421A (en) | Container abnormal behavior detection method of LSTM network based on attention mechanism | |
CN107360152A (en) | A kind of Web based on semantic analysis threatens sensory perceptual system | |
CN111259219B (en) | Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system | |
CN109872162A (en) | A kind of air control classifying identification method and system handling customer complaint information | |
CN111143838B (en) | Database user abnormal behavior detection method | |
CN108647497A (en) | A kind of API key automatic recognition systems of feature based extraction | |
CN108229170A (en) | Utilize big data and the software analysis method and device of neural network | |
CN109067800A (en) | A kind of cross-platform association detection method of firmware loophole | |
CN116361815B (en) | Code sensitive information and hard coding detection method and device based on machine learning | |
CN112148997A (en) | Multi-modal confrontation model training method and device for disaster event detection | |
CN108984514A (en) | Acquisition methods and device, storage medium, the processor of word | |
CN115277180A (en) | Block chain log anomaly detection and tracing system | |
CN116318830A (en) | Log intrusion detection system based on generation of countermeasure network | |
Harbola et al. | Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set | |
CN110889451A (en) | Event auditing method and device, terminal equipment and storage medium | |
CN109753798A (en) | A kind of Webshell detection model based on random forest and FastText | |
CN112257076A (en) | Vulnerability detection method based on random detection algorithm and information aggregation | |
CN111783063A (en) | Operation verification method and device | |
CN116226769A (en) | Short video abnormal behavior recognition method based on user behavior sequence | |
CN113259369B (en) | Data set authentication method and system based on machine learning member inference attack | |
CN115842645A (en) | UMAP-RF-based network attack traffic detection method and device and readable storage medium | |
CN113722230A (en) | Integrated assessment method and device for vulnerability mining capability of fuzzy test tool |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181012 |
|
WD01 | Invention patent application deemed withdrawn after publication |