CN112988217A

CN112988217A - Code library design method and detection method for rapid full-network code traceability detection

Info

Publication number: CN112988217A
Application number: CN202110278117.6A
Authority: CN
Inventors: 周明辉; 高恺; 何昊
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-06-18
Anticipated expiration: 2041-03-10
Also published as: CN112988217B

Abstract

The invention discloses a code library design method for rapid code traceability detection of a whole network, which aims at obtaining a code library by efficiently storing a Git object in a Git open source project used by the whole network through the processes of project discovery, data extraction, data storage, code information mapping construction and data updating, and realizes the efficient updating of the code library; the method comprises the following steps: adopting a storage mode of block storage according to the type of the Git object; establishing a relation mapping from the code file to the code file information, and quickly retrieving the whole network information of the code file; and adopting an efficient updating mode for the constructed ultra-large code library, proposing a customized Git fetch protocol based on a Libgit2 function library, and efficiently obtaining newly-added Git object data of the remote warehouse by taking the constructed ultra-large code library as a rear end. The code base generated by the method can be regularly and efficiently updated, and the method supports the rapid full-network traceability detection of the codes on the file granularity, and has high detection efficiency.

Description

Code library design method and detection method for rapid full-network code traceability detection

Technical Field

The invention provides a code base design method for rapid whole-network code traceability detection and a code base-based rapid whole-network code traceability detection method, and belongs to the technical field of software engineering.

Background

With the rapid development of open source software, a large amount of excellent open source software resources are accumulated on the network, and open source codes are increasingly used in software development. The use of the open source code improves the software development efficiency and introduces risks, for example, if the source of the open source code is not known, the subsequent bug repair of the open source code cannot be updated synchronously, and meanwhile, the open source code exposes itself to legal risks such as license compliance risk and intellectual property right, and brings different degrees of security threat and economic or reputation loss. A well-known open source risk case is a heart bleeding leak (heartbeat). It is a security hole that appears in the encryption program library OpenSSL, which is widely used to implement the transport layer security protocol of the internet. It was introduced in OpenSSL in 2012, first disclosed to the public in 2014 4 months. As long as a defective OpenSSL instance is used, either the server or the client may be attacked as a result. Therefore, tracing detection of code in a software product is crucial to the software product.

In order to implement code tracing detection of a software product, a code library for code matching search needs to be constructed, and the number of codes and the construction mode contained in the code library directly affect the accuracy and efficiency of the code tracing detection. Due to the difficulty of constructing a large code base, most of the existing code traceability detection technologies propose an efficient code detection algorithm on the premise of assuming the existing massive code base, for example, research is conducted on the most likely used open source software for selecting a substitute code from the massive open source software base to participate in traceability comparison. But lack an efficient technique for how to build codebases. In the prior art, downloading of several open source software to the local to form a code base is generally adopted. However, these code libraries have the problems that the project coverage is small and is not enough to support the whole network tracing detection of the codes, and the code tracing detection efficiency is not high due to the poor design of the code library architecture.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a code base design method for rapid full-network code traceability detection, the rapid full-network code traceability detection is realized based on the code base, the code base generated by the method can support the rapid full-network traceability detection of the code on the file granularity, and the detection efficiency is high. At the same time, the code base can be updated regularly and efficiently.

In the invention, the 'full network code' refers to the collected code data of most open source code hosting platforms. The repository hosted on the code hosting platform is referred to as a remote repository; cloning the remote warehouse to the local warehouse and then taking the remote warehouse as the local warehouse; the code library designed for the rapid full-network code traceability detection is a database formed by cloning a remote warehouse to a local warehouse and extracting data from the local warehouse.

The code library design for the rapid full-network code tracing detection provided by the invention utilizes the internal principle and the hash value of Git, and specifically comprises the following steps:

1) the remote repository may be downloaded locally via a git clone command and updates to the remote repository may be transferred back locally via a git fetch command. git fetch computes what objects are missing from the local repository versus the remote repository by comparing the headers of the local repository and the remote repository, and the remote repository then transmits these missing objects back to the local.

2) Git uses four types of data objects for versioning, the references to which are the SHA1 values calculated based on the content of the object. The commit object represents a change to an item and contains a SHA1 including the commit parent object (if any), the folder (tree object), the author ID and timestamp, the submitter ID and timestamp, and commit information. A tree object represents a folder within an item and is a list that contains the SHA1 of the file (blob) and subfolders (other tree objects) in the folder, along with their associated schema, type and name. A blob object is a compressed version of a version of the file content (source code). A tag object is a string used to associate a readable name with a particular version of a version library. One commit represents a code change, typically involving a modification to several files (blobs).

3) The hash value is a hash value of several bits calculated according to the content of the file, and different file contents generate different hash values, so that the file can be uniquely indexed by the hash value.

The invention adopts a code library design method facing the code traceability detection of the whole network, and aims at efficiently storing the Git objects in the open source project using Git in the whole network to obtain the code library, so that the code library design method can be used for code traceability detection and analysis, and simultaneously provides an efficient updating scheme of the code library. Specifically, the design and construction of the code base for the code full-network tracing detection comprises the following steps: the method comprises the steps of project discovery, data extraction, data storage, code information mapping construction and data updating. The invention designs a storage mode which is different from the traditional Git storage mode and is stored according to the types and the blocks of Git objects, the storage mode can greatly reduce the storage space of a code library and improve the efficiency of whole-network retrieval, and the invention is the initial method of the invention; the invention proposes the relationship mapping between the information (including its project and commit, author and time for creating it, file name) from the code file to construct the code file, can search the whole network information of the code file rapidly; the invention provides an efficient updating mode for the constructed ultra-large code library, a customized Git fetch protocol is provided based on a Libgit2 function library, and the constructed code library is used as a rear end, and the customized protocol can correctly obtain newly-added Git object data of a remote warehouse at extremely low time cost and space cost. Finally, the invention also provides a rapid whole-network tracing detection scheme of the codes on the file granularity.

The technical scheme of the invention is as follows:

a code base design method for rapid whole-network code traceability detection is characterized in that a code base is obtained by efficiently storing a Git object in a Git open source project in a whole network, and the code base is efficiently updated; a storage mode of storing according to the type and the blocks of the Git object is provided so as to reduce the storage space of a code library and improve the efficiency of the whole network retrieval; the relation mapping from the code file to the code file information is established, and the whole network information of the code file can be quickly retrieved; an efficient updating mode is adopted for the built ultra-large-scale code base, a customized Git fetch protocol is put forward based on a Libgit2 function base, and the built code base is used as a rear end, so that newly-added Git object data of a remote warehouse can be efficiently obtained; the code base design for the rapid full-network code tracing detection comprises the following steps: project discovery, data extraction, data storage, code information mapping construction and data updating; the method specifically comprises the following steps:

A. acquiring a full-network open source software project list by a plurality of project discovery methods;

open source software projects are mostly hosted in some popular development collaboration platforms such as GitHub, Bitbucket, GitLab and SourceForge. The invention adopts a plurality of methods including finding the project by utilizing the methods of API provided by a development collaboration platform, analyzing the web page of the platform and the like, and then using the union of the found project sets as the final open source project list so as to obtain the open source project list.

In the specific implementation, the method can be completed on a common server (such as an Intel E5-2670 CPU server), and the requirement on hardware is low. The invention packs the scripts of the project discovery process into the docker image.

B. Data extraction: b, downloading the items in the open source item list acquired in the step A to the local and extracting a Git object in the items;

in particular implementations, a copy of the remote repository is created locally via the git clone command. And after the open source project is copied in batches, extracting all the Git objects in the cloned open source project in batches through Git.

Data extraction can be done in parallel on the (cloud) server. The invention uses the C language interface Libgit2 of Git to list all Git objects in the project, then classifies the objects according to the object types, and finally extracts the contents of each object. The invention specifically adopts a cluster which is provided with 36 nodes, the CPU of each node is a 16-core Intel E5-2670 CPU, the memory is 256GB, and each node starts 16 threads to finish the above Git object extraction work. A node can process about 5 ten thousand items in 2 hours. After the Git data in the cloned project is extracted, the cloned project is deleted, and then a new clone-extraction process is started.

Git object data store: storing the Git object data in a blocking manner according to the Git object type classification type, reducing the data storage space and improving the parallel processing efficiency; the method specifically comprises the following steps:

a. binary files (such as PDF and pictures) included in the open source item are not saved;

b. the Git object data is stored according to the Git object type classification type, namely the types of the database comprise a commit database, a tree database, a blob database (not containing a binary blob) and a tag database. This storage reduces the data storage space to the hundred TB level, while also enabling quick retrieval of whether data is stored in the codebase.

c. The database of each type of Git object comprises cache data and content data, and the cache data and the content data are respectively stored in a cache database and a content database so as to accelerate the retrieval speed; each type of database (i.e. commit database, tree database, blob database, tag database) comprises a cache database and a content database which can be divided into a plurality of parts (e.g. 128 parts) for parallel; the cache database is used to quickly determine whether a given Git object is already stored in the database and is necessary for data extraction (if it exists, the Git object is not extracted, thus saving time). In addition, caching the database also helps determine whether a warehouse needs to be cloned. If a warehouse's head (the commit object pointed to by each branch in the git/refs/headers) is already in the cache database, no cloning is required.

d. The cache database is a key value database; the content database is stored in a splicing mode so as to be convenient to update.

The cache database is a key value database where the key is the SHA1 value (20 bytes) for the Git object, the offset location and size of the Git object in the content database after compression using Perl's compress library. The content database contains compressed content of Git objects that are continuously stitched together. The content database is stored in a splicing mode, updating can be completed quickly, and only new content needs to be spliced to the tail of a corresponding file. For the commit and tree objects, respectively, a random lookup key value database is additionally created, where the key is the SHA1 of the Git object and the value is the compressed content of the corresponding Git object. The random query performance of the key-value database is high, and each thread can query git objects above 170K per second.

e. Parallelization is achieved using the SHA1 values.

The present invention uses the last 7 bits of the first byte of the SHA1 value of the Git object to partition each type of database into 128 parts. Thus, there are 128 cache databases and 128 content databases for each of the four types of Git objects. In addition, the commit object and the tree object respectively have 128 random lookup key value databases, and the databases have 128 × 4+4+2 databases, and can be placed on a server to accelerate the parallel operation. In specific implementation, the size of a single content database is from 20MB of a tag object to 0.8TB of a blob object, and the size of a single cache database is 2Gb, wherein the size of the single cache database is the maximum tree object.

f. The invention uses a database TokyoCabinet (similar to berkeley db) written in C language.

Tokyo bin, using hashes as an index, can provide about ten times faster read query performance than various common key-value databases such as MongoDB or Cassandra. The method has the advantages that the faster reading query speed and the extremely strong portability just meet the construction requirements of the code base facing the whole-network code tracing detection, so that the method adopts the database Tokyocabinet instead of the NoSQL database with more complete functions.

D. And (3) code information mapping construction:

the invention relates to a code library, which aims to quickly carry out whole-network traceability detection on codes and support the analysis of the safety and the compliance of software projects. Obtaining this information for a code file is useful for a comprehensive assessment of the security and compliance of the software project.

The invention constructs a relational mapping by taking commit as a center, and specifically comprises the following steps:

building a mutual mapping between commit and item, building a relation mapping of commit and author, time, building a relation mapping of author to commit, building a mutual mapping of commit to code file (blob), and a mutual mapping of commit to filename.

The item list containing one code file (blob) can be determined by the combination of the relations of the code file (blob) to commit and commit to item; the creation time of the code file (blob) can be determined by the combination of the relationships of the code file (blob) to commit and commit to time, and the author of the code file (blob) can be determined by the combination of the relationships of the code file (blob) to commit and commit to author.

A correlation map between code files and file names is also constructed to support the tracing of particular code fragments.

These relational maps are saved using the TokyoCabinet database for quick retrieval. The present invention still uses chunking storage to improve retrieval efficiency, specifically the present invention partitions each type of relational mapping into 32 sub-databases. For commit and (code file) blob, the last 5 bits of their first character of SHA1 are used for partitioning. For authors, projects and filenames, the present invention uses the last 5 bits of the first byte of their FNV-1Hash for partitioning.

E. Data update

The Git objects are immutable (i.e., existing Git objects will remain unchanged and only new Git objects will exist), so only these new Git objects need be acquired. The invention specifically uses two methods to update the code base:

a. a new Git item is identified, cloned and then the Git objects are extracted.

b. The updated project is identified by obtaining the latest commit of the branch of the remote repository of the collected repository, and then the Git fetch protocol is modified so that the protocol can obtain the newly added Git object of the remote repository with the built code library as the back end without a local Git repository (the cloned Git repository is deleted after the data is extracted in step B to save space), and extract the newly added Git object into the code library. The invention restores the flow of the git fetch by the source code which realizes the git fetch function in the Libgit2, and specifically comprises the following steps:

b1) the remote repository is added to the local repository. The remote repository is represented in Libgit2 by a git _ remote structure, which when created populates all branch references in the git/refs/headers folder within the local repository into a member variable (ref) within the structure;

b2) the local warehouse establishing a connection to the remote warehouse;

b3) after the connection is established, the remote warehouse will reply (reply), and all branch references (contents in the git/refs/headers folder) of the remote warehouse are sent to the local;

b4) after receiving the references sent back by the remote warehouse, the local warehouse checks one by one whether the objects pointed by the references are in the local warehouse, and if so, marks the objects to indicate that the branch is not updated, and does not need to request the remote warehouse to send the updates. These references are then inserted into the member variables mentioned in step b1

b5) After the local store has checked all of these references, the member variable is sent back to the remote store (including the marked references) to "negotiate" (new) with the remote store. Here the local waits for an ACK signal from the remote repository. The manner in which Libgit2 waits here is to sort the commit objects in the local warehouse in chronological order, then traverse from the most recent commit, for each commit object, send it to the remote warehouse telling it that there is this object locally, and then send it to the remote warehouse a reference that has been checked. This is repeated up to 256 times until an ACK signal is received back from the remote warehouse.

b6) After negotiating with the remote repository (i.e., telling the remote repository what the most recent commit is and what is desired), the remote repository can calculate which Git objects to send back to the local. The remote repository packs these objects into a packfile format file and sends back to the local.

b7) After receiving the returned data, the local warehouse analyzes the data according to the format of the packfile, and constructs a corresponding index (index) file, thereby facilitating retrieval. When the index file is constructed, the index file needs to be restored according to the Git objects in the local Git warehouse.

As can be seen from the step of Git fetch, except for steps b5) and b7), no other process involves the Git objects except for the branch reference pointing to, and Git fetch determines if the remote repository is updated by comparing the branch reference of the remote repository to the local. We propose to modify git fetch as follows:

1) modify original git fetch step b 3: and storing the branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in the local code base, if so, indicating that the remote warehouse is not updated, and if not, indicating that the remote warehouse is updated, entering the next step.

2) Modify original git fetch step b 5: the original git fetch protocol sorts and sends commit to the remote warehouse just to wait for the ACK signal from the remote warehouse, and there is no special role, so the invention changes to a waiting method: the commit object with the latest primary branch is sent each time, and is repeated for 256 times at most until an ACK signal of the remote warehouse is received

3) Modify original git fetch step b 6: and (4) storing the files in the packfile format sent back by the remote warehouse to the local, analyzing the packfile files according to the Git objects in the code library, and not performing step b 7.

After the git fetch is modified, the git fetch can be updated by taking the constructed code library as a rear end, a complete warehouse does not need to be cloned every time of updating, and meanwhile, the network bandwidth overhead and the time overhead are reduced.

In specific implementation, the invention also provides a code library-based method for quickly detecting the source tracing of the codes on the file granularity, which comprises the following steps:

1) for a code file, its SHA1 value is calculated

2) And D, according to the code information mapping constructed in the step D, by taking the SHA1 of the code file as a key, inquiring the whole network information of the code file, including information such as an item list, a commit list and a corresponding file name and author of the code file, and feeding back the information to the user.

Compared with the prior art, the invention has the beneficial effects that:

the code base design provided by the invention can support efficient whole-network traceability detection on the code. Through the technical scheme and the embodiment provided by the invention, the construction of the local code library for the open source Git warehouse on a plurality of code hosting platforms including the GitHub in the whole network can be completed without a large number of servers; incremental updates to the code base can be accomplished without requiring particularly much bandwidth.

The technical scheme and the embodiment provide detailed guidance for constructing the code base for the whole-network code traceability detection, and make up for the vacancy of a massive code base construction technology in the field of code traceability detection.

Drawings

FIG. 1 is a flowchart of a code library design method for fast full-network code tracing detection in an embodiment of the present invention

Fig. 2 is a flow chart of a code base update policy in the embodiment of the present invention.

FIG. 3 is a block flow diagram of a customized git fetch process in an embodiment of the present invention.

FIG. 4 is a block diagram illustrating a process for obtaining remote warehouse updates based on a customized git fetch protocol in an embodiment of the present invention.

Fig. 5 is a flowchart of a fast full-network code tracing detection method based on a constructed code library in the embodiment of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a code base design method for rapid whole-network code traceability detection, which specifically comprises the following steps:

A. and acquiring a full-network open source software project list by a plurality of project discovery methods. The realization method comprises the following steps:

at present, most open source software projects are hosted in some popular development and cooperation platforms such as GitHub, Bitbucket, GitLab and SourceForge. Still some open source items are hosted on a website of an individual or a particular item. Therefore, to support full traceback detection of code, it is desirable to obtain as complete a list of open source items as possible. To address this challenge, the present invention incorporates methods such as utilizing platform-provided APIs, parsing the web pages of the platform, etc. to discover items. And finally, taking the union of the item sets discovered by the methods as a final open source item list.

B. Data extraction: downloading the items in the source item list in the step A to the local and extracting a Git object in the items;

this step is responsible for downloading the items found in step A locally and extracting the Git objects therein. A copy of the remote repository is created locally via the git clone command. After the project is copied in batches, all the Git objects in the cloned project are extracted in batches through Git. This step can be done in parallel on the (cloud) server.

C. Data storage: the method has the advantages that the types and the blocks are stored according to the type of the Git object, so that the data storage space is reduced, and the parallel processing efficiency is improved;

there may be many duplicate Git objects between open source projects due to multiplexing code, pull-request development patterns, etc. Meanwhile, the open source item may also include many binary files, such as PDFs and pictures. Without removing such redundancy and binary files, it is estimated that the required data storage space would exceed 1.5PB, and such a huge amount of data would render the code tracing task almost impossible to implement. In order to avoid redundancy of the Git objects among the warehouses, the code base is designed for full-network code source tracing detection, so that binary files are not stored in the invention, and the invention stores the Git objects in types, namely a commit database, a tree database, a blob database (not containing binary blobs) and a tag database. The storage mode can reduce the data storage space to the hundred TB level and can quickly search whether the data is stored in the code base.

D. And (3) code information mapping construction:

the code library aims to quickly perform whole-network source tracing detection on codes and support analysis on the safety and compliance of software projects, and for this purpose, the invention constructs the relational mapping from a code file (blob) to a project containing the blob, from the code file to the commit containing the blob, from the code file to the author of the blob, from the code file to the file name of the blob and from the code file to the creation time of the blob, and the relational mapping is stored in a database form, so that the whole-network information of the code file, such as the project containing the blob and the commit, the author creating the blob and the creation time of the blob, can be quickly obtained, and the construction of the code information mapping is realized. Obtaining this information for a code file is useful for a comprehensive assessment of the security and compliance of the software project.

E. Data update

Keeping code libraries up to date is vital to the code traceability detection task. With the increase in the size of existing warehouses and the advent of new warehouses, the process of cloning all of the warehouses takes longer and longer. Currently, to clone all git warehouses (over 1 hundred million and 3 million including fork), it is estimated that the total time requires six hundred single-threaded servers to operate for one week, and as a result, will occupy disk space in excess of 1.5 PB. Fortunately, Git objects are immutable (i.e., existing Git objects will remain unchanged and only new Git objects will exist), so only these new Git objects need to be fetched. Specifically, the present invention proposes updating the codebase using two strategies:

1. a new Git item is identified, cloned and then the Git objects are extracted.

2. The updated item is identified by obtaining the latest commit of all branches of the remote repository of the collected repository, and then the Git fetch protocol is modified so that the protocol can obtain the update of the remote repository with the built code library as the back end without a local Git repository (the cloned Git repository is deleted after the data is extracted in step B to save space), and extract the newly added Git object into the code library. According to the invention, the process of git fetch is restored by the source code for realizing git fetch function in the Libgit2, as shown in FIG. 2, the process specifically comprises the following 7 steps:

1) the remote repository is added to the local repository. The remote repository is represented in Libgit2 by a git _ remote structure, which when created populates all branch references in the git/refs/headers folder within the local repository into a member variable (ref) within the structure;

2) the local warehouse establishing a connection to the remote warehouse;

3) after the connection is established, the remote warehouse will reply (reply), and all branch references (contents in the git/refs/headers folder) of the remote warehouse are sent to the local;

4) after receiving the references sent back by the remote warehouse, the local warehouse checks one by one whether the objects pointed by the references are in the local warehouse, and if so, marks the objects to indicate that the branch is not updated, and does not need to request the remote warehouse to send the updates. These references are then inserted into the member variables mentioned in step 1

5) After the local store has checked all of these references, the member variable is sent back to the remote store (including the marked references) to "negotiate" (new) with the remote store. Here the local waits for an ACK signal from the remote repository. The manner in which Libgit2 waits here is to sort the commit objects in the local warehouse in chronological order, then traverse from the most recent commit, for each commit object, send it to the remote warehouse telling it that there is this object locally, and then send it to the remote warehouse a reference that has been checked. This is repeated up to 256 times until an ACK signal is received back from the remote warehouse.

6) After negotiation with the remote warehouse (i.e. telling the remote warehouse: what the local warehouse branches are the latest commit, which is desired), the remote warehouse clerk can calculate which Git objects to send back to the local. The remote repository packs these objects into a packfile format file and sends back to the local.

7) After receiving the returned data, the local warehouse analyzes the data according to the format of the packfile, and constructs a corresponding index (index) file, thereby facilitating retrieval. When the index file is constructed, the index file needs to be restored according to the Git objects in the local Git warehouse.

As can be seen from the step of Git fetch, except for steps 5) and 7), no other process involves other Git objects except for the branch reference to which the branch reference points, and Git fetch determines whether the remote repository is updated by comparing the branch reference of the remote repository to the branch reference of the local repository. The invention proposes to modify git fetch as follows:

1) modify original git fetch step 3: and storing the branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in the local code library, if so, indicating that the remote warehouse has no newly added Git object data, and if not, entering the next step.

2) Modify original git fetch step 5: the original git fetch protocol sorts and sends commit to the remote warehouse just to wait for the ACK signal from the remote warehouse, and there is no special role, so the invention changes to a waiting method: the commit object with the latest primary branch is sent each time, and is repeated for 256 times at most until an ACK signal of the remote warehouse is received

3) Modify original git fetch step 6: and (5) storing the files in the packfile format sent back by the remote warehouse to the local, analyzing the packfile files according to the Git objects in the code library, and not performing the step 7.

Finally, the invention provides a rapid full-network tracing detection scheme of codes on file granularity, which specifically comprises the following two steps:

1. for a code file, its SHA1 value is calculated

2. And D, according to the code information mapping tool database constructed in the step D, by taking the SHA1 of the code file as a key, inquiring the whole network information of the code file, including information such as an item list, a commit list and a corresponding file name and author of the code file, and feeding back the information to the user.

Preferably, the step B uses Git's C language interface Libgit2 (because C language is more efficient and faster) to complete the extraction task.

Preferably, the TokyoCabinet database is used for the steps C and D.

Preferably, the step E uses Git's C language interface Libgit2 to implement a customized Git fetch protocol.

Fig. 2 is a flowchart of a code library design method for fast full-network code tracing detection in the embodiment of the present invention, including the following specific implementation steps:

A. and (3) item discovery:

in order to obtain as complete an open source project list as possible, the present invention incorporates a variety of heuristics, including using APIs to develop collaboration platforms, parsing the platform's web pages, etc. to discover projects. And finally, taking the union of the item sets discovered by the methods as a final open source item list. The invention packs the scripts of the project discovery process into the docker image. Specifically, the project discovery method adopted by the invention is as follows:

1. an API is used that develops a collaboration platform. Some code hosting platforms such as GitHub provide APIs that can be used to discover the complete set of open source items on this platform. These APIs are platform specific and may be used differently, thus requiring different API queries to be designed for different platform APIs. However, these APIs generally have access rate limitations for users or IP addresses, and this limitation can be overcome by building a pool of user IDs. For the GitHub platform, the GitHub warehouse list with updates is obtained by using the GraphQL API of the GitHub, the concrete operation is that the time period of the warehouse required to be obtained is divided equally according to the number of user IDs in the user ID pool, then each user ID is responsible for updating the number of the warehouse in one time period, and the query condition is as follows: { is: public allocated: false allocated: start _ time.. end _ time }, where start _ time and end _ time are replaced by 10 minutes in each time period, and the updated warehouse quantity in each 10-minute interval is obtained; for the bitucket platform, the api query used is https:// api. Bitbucket. org/2.0/repositories/? A bitstruct warehouse created after 2017-11-18 days can be obtained by replacing date with a specific time such as 2017-11-18; for the SourceFreg platform, the platform provides an XML-format project list, the address of the XML file is https:// Sourceford.net/sitemap.xml, and all the project lists on SourceFreg can be obtained by downloading XML analysis; for the GitLab platform, the API query used is https:// GitLab, com/API/v4/project sarcoshived ═ false & membership ═ false & order _ by ═ created _ a t & moved ═ false & page { } & per _ page ═ 99& simple ═ false & start ═ desk & stable ═ false & with _ issue _ enabled ═ false & with _ merge _ request _ now _ enable ═ false, where the parameter of the page is set to 1, and then all items on the obtained are incremented.

2. And analyzing the web pages of the website. For the Bioconductor platform, all items on the website can be obtained by analyzing http:// git. Bioconductor. org webpage; for the repo.or.cz platform, by parsing https:// repo.or.cz/? Obtaining all the items on the website by assuming that a is project _ list webpage; for the Android platform, all projects on the website can be obtained by analyzing https:// Android.

For a ZX2C4 platform, analyzing https:// git.zx2c4.com webpage to obtain all projects on the platform; for an eclipse platform, analyzing http:// git. eclipse.org/webpage to obtain all projects on the platform; analyzing http:// git. PostgreSQL. org webpage for a postgreSQL platform to obtain all projects on the platform;

analyzing http:// git. kernel.org webpage for the kernel.org platform to obtain all projects on the platform; and analyzing the http:// git.savannah.gnu.org/cgit webpage for the Savannah platform to obtain all the projects on the platform.

This step can be done on a common server (e.g., Intel E5-2670 CPU server) with low hardware requirements. By 9 months 2020, we have retrieved over 1 hundred and 3 thousand different warehouses (excluding the GitHub warehouse labeled fork and the warehouse without content).

B. Data extraction:

this step can be done in parallel on a very large number of servers, but requires a large amount of network bandwidth and storage space. The remote warehouse is cloned to the local in batch through a git clone command, and after measurement, 2 to 5 million randomly selected items can be cloned by a single thread shell process on an Intel E5-2670 CPU server for 24 hours without the limitation of network bandwidth (the time is greatly changed according to the size of the warehouse and the platform). To clone all items (over 1 million 3 million) in a week requires about 400 servers, which is costly. Thus, the present invention optimizes the search by running multiple threads on each server and searches only a small fraction of the repositories that have changed since the last search. The invention uses 5 data transmission nodes on a computing cluster platform with 300 nodes and bandwidth as high as 56Gb/s to complete the cloning task at present. In addition, the step can be completed by using a cloud server instead of a computing cluster, customized cloud service resources meeting the requirements of the user can be purchased at the cloning time, and then the resources are released after the batch cloning is finished. The cloud server can achieve higher bandwidth and the cloning speed is higher.

After the project is cloned locally, all the Git objects in the project need to be extracted. The Git client can only display the content of one Git object one by one, which is not beneficial to automatic batch processing. The invention uses the C language interface Libgit2 of Git to list all Git objects in the project, then classifies the objects according to the object types, and finally extracts the contents of each object. At present, each node starts 16 threads to complete the above Git object extraction work on a cluster which has 36 nodes, and the CPU of each node is an Intel E5-2670 CPU with 16 cores and the memory is 256 GB. A node can process about 5 ten thousand items in 2 hours. After the Git data in the cloned project is extracted, the cloned project is deleted, and then a new clone-extraction process is started.

C. Data storage: and the storage is divided into types and blocks according to the type of the Git object, and binary files are not stored, so that the data storage space is reduced, and the parallel processing speed is increased.

The method stores the objects in different types according to the type of the git objects so as to avoid redundancy and reduce storage cost; code tracing detection is oriented, and binary files are not stored during storage; each Git object database comprises cache data and content data, and the cache data and the content data are respectively stored in a cache database and a content database so as to accelerate the retrieval speed; to allow parallelism, the cache database and content database for each type of Git object may be divided into multiple shares (e.g., 128 shares) for parallelism; the content database is stored in a splicing mode so as to be convenient to update.

Specifically, the present invention stores separately by type of Git object to avoid redundancy, and thus there are 4 types of databases in common: a commit database, a blob database, a tree database, and a tag database. Each database contains cache data and content data, which are stored in the cache database and the content database, respectively. The cache database is used to quickly determine whether a particular object is already stored in our database and is necessary for the data extraction described above (if it exists, the Git object is not extracted, thus saving time). In addition, caching the database also helps determine whether a warehouse needs to be cloned. If the head (commit object pointed to by each branch in git/refs/heads) of a warehouse is already in our cache database, indicating that the warehouse is not updated, there is no need to clone the warehouse.

The cache database is a key value database where the key is the SHA1 value (20 bytes) for the Git object, the value being the offset location and size of the Git object in the content database after compression using Perl's compress library. The content database contains compressed content of Git objects that are continuously stitched together. The content database is stored in a splicing mode, so that the updating can be rapidly completed, and only new content needs to be spliced to the tail of a corresponding file. Although this storage method can quickly scan the entire database, it is not the best choice for the required random lookup, for example, when calculating the modification made by a commit, we need to go through the commit database twice to obtain the tree object pointed by the commit object and the tree object pointed by its parent commit object, then go through the tree database many times to obtain the contents contained in the two tree objects, find out the file with difference, and finally go through the blob database once to calculate the modification, each pass will cause repeated extra time overhead. Thus, for commit and tree, respectively, the present invention additionally creates a random lookup key-value database, where the key is the SHA1 of the Git object and the value corresponds to the compressed content of the Git object. The random query performance of the key-value database is relatively fast, and tests show that: a single thread on a server with a CPU of Intel E5-2623 can query 100 million git objects randomly within 6 seconds, i.e., each thread queries more than 170K git objects per second.

At present, the invention retrieves more than 200 million Git objects (including more than 23 million commit objects, more than 91 million blob objects, more than 94 million tree objects and more than 1800 million tag objects), and the data storage space is about 150 TB. Processing such large volumes of data becomes particularly inefficient if not processed in parallel. The invention realizes parallelization by utilizing SHA1 values. The present invention uses the last 7 bits of the first byte of the SHA1 value of the Git object to partition each type of database into 128 parts. Thus, there are 128 cache databases and 128 content databases for each of the four types of Git objects. In addition, the commit object and the tree object respectively have 128 random lookup key value databases, and the databases have 128 × 4+4+2 databases, and can be placed on a server to accelerate the parallel operation. Currently, the size of a single content database ranges from 20MB of a tag object to 0.8TB of a blob object, and the size of a single cache database is 2Gb, which is a tree object at most.

However, the size of the database limits the choice of database. For example, a graph database like neo4j is very useful for storing and querying relationships, including transitive relationships, but it does not (at least on a common server) handle billions of levels of relationships. In addition to neo4j, many conventional databases have been tried by the present invention. The invention evaluates the common relational databases MySQL and PostgreSQL and the key-value database (NoSQL) databases MongoDB, Redis and Cassandra. SQL, like all centralized databases, has limitations in dealing with PB-level data. The present invention therefore focuses on NoSQL databases, which are designed for large-scale data storage and massively parallel data processing on a large number of commercial servers.

Through testing, the present invention uses a database written in language C (similar to berkeley db) named Tokyocabinet. Tokyo bin, using hashes as an index, can provide about ten times faster read query performance than various common key-value databases such as MongoDB or Cassandra. The faster reading query speed and the extremely strong portability just meet the construction requirement of a code library facing the whole-network code tracing detection, so that the NoSQL database with more complete functions is replaced by the NoSQL database.

D. Code information mapping construction, comprising:

designing and generating a relation mapping capable of quickly mapping a code file (blob) to information of the blob, wherein the information of the blob comprises an item and a commit containing the blob, an author and a time for creating the blob, and a file name of the blob, and the relation mapping is stored in a database form, and the relation mapping can quickly search the whole network information of the blob

The code library aims to quickly perform full-network traceability detection on codes and support the analysis on the safety and compliance of software projects. Therefore, the invention generates the relation mapping of the code file (blob) to the information thereof (including the item and commit containing the blob, the author and time for creating the blob and the file name), and saves the relation mapping in the form of a database, thereby being capable of retrieving the whole network information of the code file. . The whole network information of the code file is useful for the comprehensive evaluation of the safety and the compliance of the software project and is an important content of the whole network code traceability detection.

The information of the code file includes the item and commit containing it, its filename, the author who created it and the time. Wherein, the author and time for creating it are included in creating its commit, and the commit to item relationship mapping and the item to commit relationship mapping can be completed in step B. Therefore, the invention constructs a relational mapping with commit as the center, specifically: building a mutual mapping between commit and item, building a relation mapping of commit and author, time, building a relation mapping of author to commit, building a mutual mapping of commit to code file (blob), and a mutual mapping of commit to filename. Then, an item list containing a code file (blob) can be determined by the combination of the relationships of the code file (blob) to commit and commit to item; similarly, the creation time of a code file (blob) may be determined by the combination of the relationships of the code file (blob) to commit and commit to time, and the author of the code file (blob) may be determined by the combination of the relationships of the code file (blob) to commit and commit to author.

Mapping from commit to author, time and item is not difficult to implement because author and time are part of the commit object and the mapping between commit and item is available at step B data extraction. But a commit introduced or deleted code file (blob) has no direct relation to commit and needs to be computed by recursively traversing the commit and its parent commit's tree objects. One commit contains a snapshot of the warehouse, containing all trees (folders) and blobs (code files). To compute the difference between a commit and its parent commit, i.e. a new code file (blob), we traverse each child tree and extract all the code files (blobs), starting with the tree object pointed to by the commit object, respectively. By comparing all code files (blobs) of each commit, a new code file (blob) introduced by the commit can be obtained. On average, it takes approximately 1 minute to obtain the filename and code file (blob) of ten thousand commit changes in a single thread. It is estimated that the overall time for a single thread for over 23 million commit takes 104 days, which can be done in a week by running 16 threads on a server with 16 cores of Intel E5-2623 CPU. In addition, these relationships are incremental, and only need to be generated once, and then the above operations are performed on the commit for each update, and then inserted into the existing database. According to the combination of the relationships of the code file (blob) to commit and the relationship of commit to filename, the correspondence of the code file (blob) and filename cannot be determined because one commit may modify a plurality of files. The invention also constructs the mutual relation mapping between the code file and the file name to support the tracing of the specific code segment. For example, if a piece of Python code is to be checked for traceability, then all Python files are checked. Then the file name to code file mapping can take all the Python files ending with py and then perform code tracing checks on these files.

Similar to the data storage part in step C, the invention uses the Tokyocabinet database to store the relational maps for quick retrieval. The invention uses chunking storage to improve the retrieval efficiency, and particularly divides each type of relational mapping into 32 sub-databases. For commit and (code file) blob, the present invention uses the last 5 bits of their SHA1 first character for partitioning. For authors, projects and filenames, the present invention uses the last 5 bits of the first byte of their FNV-1Hash for partitioning.

E. Data update

Keeping code libraries up to date is vital to the code traceability detection task. In order to obtain an acceptable update time, the invention completes the update of the data in the following way:

1. a new Git item is identified, cloned and then the Git objects are extracted. And B, comparing the new open source item list discovered in the step A with the last open source item list, determining the newly added items, cloning the newly added items to the local, and extracting the Git objects in the newly added items.

2. And identifying the updated item, then cloning the updated item only, and extracting the newly added Git object. The invention modifies git fetch protocol based on Libgit2 as follows:

1) modify original git fetch step 3: and saving the branch reference sent back by the remote warehouse to the local. After the filter _ waters function in the src/fetch.c file of Libgit2 calls the git _ remote _ ls function, the SHA1 value of headers sent back by the remote warehouse received by git _ remote _ ls is saved to the file.

2) Modify original git fetch step 5: src/transport/smart _ protocol. c file of Libgit2 is modified, git _ smart __ new _ fetch function is modified: noting the call for git _ revwalk _ next, a git _ reference _ name _ to _ id call is added such that the most recent commit object of the primary branch is sent each time, repeating up to 256 times until an ACK signal is received from the remote warehouse.

3) Modify original git fetch step 6: modify the git _ smart __ neighbor _ fetch function in the/src/transport/smart _ protocol. c file of Libgit2 to send the remote warehouse back to the data (git _ pkt _ progress)

P) is saved in the local file, and the return is directly returned without performing the step 7.

After the modification, recompiling the Libgit2 library, and then using the modified Git fetch protocol to acquire the newly added Git object data of the remote warehouse, the specific steps are as follows:

1. initializing an empty Git warehouse

2. The SHA1 values and contents of all branch references of a repository are extracted from the built codebase and filled into the empty git repository. The filling mode is as follows: and constructing header information of the branch reference, wherein the format is as follows: object type + space

+ object content length + one dummy byte, such as "blob 12\ u 0000". Then, the header information and the original data are spliced together, and the spliced content is compressed by using the compress function of zlib. Finally, creating a subdirectory named SHA1 in the git/objects folder in the empty git warehouse, creating a file named SHA1 later 38 bits in the subdirectory, and writing the compressed content into the file

3. Create a file named Branch name (e.g., master) in the Git/refs/headers folder in this empty Git warehouse, and then write the SHA1 value for the commit referenced by the Branch into this file

The invention directly splices the data of the newly added Git object into the corresponding content database according to the type and the SHA1 value thereof, records the SHA1 value thereof in the cache database, and updates the corresponding relational mapping database according to the offset and the size in the content file.

After the code base is constructed, the code can be subjected to rapid full-network traceability detection on the file granularity. The method comprises the following steps:

1. the SHA1 value of the code file is calculated. Here the calculations were performed using the sha1 function of the hashlib library of python 2. For example, https:// github. com/fchollet/deep-learning-models/blob/master/restet 50.py file contains an implementation of the deep learning model ResNet50, calculated to have an SHA1 value of e8cf3d7c248fbf6608c4947dc53cf368449c8c5f

2. And D, according to the code information mapping tool database constructed in the step D, by taking the SHA1 of the code file as a key, inquiring the whole network information of the code file, including information such as an item list, a commit list and a corresponding file name and author of the code file, and feeding back the information to the user. 192 commit containing the blob are obtained through mapping the blob to the commit, and 377 project containing the blob are obtained through mapping the commit to the project. The above process only requires 0.831 s.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A code base design method for rapid whole-network code traceability detection is characterized in that a code base is obtained by efficiently storing a Git object in a Git open source project for a whole network through project discovery, data extraction, data storage, code information mapping construction and data updating processes, and the code base is efficiently updated;

the method comprises the following steps: adopting a storage mode of block storage according to the type of the Git object; establishing a relation mapping from the code file to the code file information, and quickly retrieving the whole network information of the code file; an efficient updating mode is adopted for the built ultra-large-scale code base, a customized Git fetch protocol is put forward based on a Libgit2 function base, and the built ultra-large-scale code base is used as a rear end to efficiently obtain newly-added Git object data of a remote warehouse;

the method specifically comprises the following steps:

A. acquiring a full-network open source software project list by using one server through a plurality of project discovery methods, and packaging scripts in the project discovery process into a docker mirror image;

B. data extraction: b, downloading the items in the open source item list acquired in the step A to the local and extracting a Git object in the items; the extraction is completed in a multithread parallel mode on the server cluster;

git object data store: the method has the advantages that the types and the blocks are stored according to the type of the Git object, so that the data storage space is reduced, and the parallel processing efficiency is improved; the method specifically comprises the following steps:

a. the binary file included in the open source project is not saved;

b. the method comprises the steps of performing classified storage on the Git object data according to the Git object types, namely the types of databases comprise a commit database, a tree database, a blob database and a tag database, so that the data storage space is reduced to a hundred TB level, and meanwhile, whether the data are stored in a code library or not can be quickly searched;

c. the database of each type of Git object comprises cache data and content data, and the cache data and the content data are respectively stored in a cache database and a content database, so that the retrieval speed is accelerated; dividing a cache database and a content database contained in each type of database into a plurality of parts for parallel use; the cache database is used for quickly determining whether a given Git object is already stored in the database and is necessary for data extraction; if a certain Git object exists in the database, the Git object is not extracted; the cache database is also used for determining whether a warehouse needs to be cloned; if the commit object pointed to by the head of a warehouse is already in the cache database, i.e., no cloning is required;

d. the cache database is a key value database; the content database is stored in a splicing mode so as to be convenient to update;

the key in the cache database is the SHA1 value of the Git object, and the value in the cache database is the offset position and size of the Git object in the content database after being compressed by the Perl compression library;

the content database contains compressed contents of Git objects which are continuously spliced together; the content database is stored in a splicing mode, and only new content needs to be spliced to the tail of a corresponding file;

respectively creating a random lookup key value database for the commit object and the tree object, wherein the key is SHA1 of the Git object, and the value is the compressed content of the corresponding Git object;

e. dividing each type of database into a plurality of parts by utilizing the SHA1 value to realize parallelization acceleration;

f. using a database TokyoCabinet indexed by hash;

D. with commit as a center, constructing mapping of a code information relationship; the method comprises the following steps: a relational mapping of a code file to an item containing it, a code file to a commit containing it, a code file to its author, a code file to its filename, a code file to its creation time; storing the relational maps in a blocking storage mode by using a TokyoCabinet database so as to perform quick retrieval;

E. acquiring a new Git object and updating data of the code library; the method comprises two methods:

a. identifying a new Git project, cloning, and extracting a Git object in the Git project;

b. identifying an updated item by acquiring the latest commit of the branch of the remote warehouse of the collected warehouse, and then acquiring a newly-added Git object of the remote warehouse by using the constructed code library as a rear end when the local Git warehouse does not exist through the step of modifying Git fetch, and extracting the newly-added Git object to the code library; the method specifically comprises the following steps:

b1) adding the remote repository to the local repository; the remote warehouse is represented in Libgit2 by a git _ remote structure,

when the structure is created, all branch references in the git/refs/headers folder of the local repository are populated into a member variable ref within the structure;

b2) establishing a connection from a local warehouse to a remote warehouse;

b3) after the connection is established, the remote warehouse replies to reply, and all branch references of the remote warehouse, namely contents in the git/refs/headers folder, are sent to the local;

storing the branch references sent back by the remote warehouse to the local, judging whether the branch references are stored in the local code base, if so, indicating that the remote warehouse is not updated, and if not, indicating that the remote warehouse is updated, entering the next step;

b4) after receiving the references sent back by the remote warehouse, the local warehouse checks whether the objects pointed by the references are in the local warehouse one by one; if the remote repository is in the local repository, the mark is made, which indicates that the remote repository does not need to be requested to send updates, and then the references are inserted into the member variables;

b5) the local warehouse sorts the commit, and sends the member variable including the marked reference back to the remote warehouse to negotiate with the remote warehouse; locally waiting for an ACK signal sent back by the remote warehouse; specifically, the commit object with the latest main branch is sent each time, and the process is repeatedly executed for a plurality of times until an ACK signal of the remote warehouse is received;

b6) after negotiating with the remote warehouse, the remote warehouse can calculate the Git object to be sent back to the local; the remote warehouse packs the objects into files in a packfile format and sends the files back to the local;

storing the files in the packfile format sent back by the remote warehouse to the local, and analyzing the packfile files according to the Git objects in the code library;

through the steps of modifying the git fetch, the git fetch updates the built code library as the rear end, a complete warehouse does not need to be cloned every time of updating, and meanwhile, the network bandwidth overhead and the time overhead are reduced.

2. The code library design method for the rapid code tracing detection of the whole network as claimed in claim 1, wherein the code source tracing detection method of the whole network based on the code library on the file granularity comprises the following steps:

1) calculating the SHA1 value of a code file;

2) and D, according to the code information mapping constructed in the step D, with SHA1 of the code file as a key, inquiring the whole network information of the code file, including an item list, a commit list and corresponding file name and author information of the code file, and feeding back the information to the user.

3. The method as claimed in claim 1, wherein in step a, the server comprises an Intel E5-2670 CPU server; the development collaboration platform for hosting the open source software project comprises GitHub, Bitbucket, GitLab and SourceForge; the acquiring of the full-network open source software project list through the multiple project discovery methods comprises the following steps: and acquiring the open-source project list by using an API (application programming interface) provided by the development collaboration platform and a webpage method of the analysis platform and taking the union of the discovered project sets as a final open-source project list.

4. The method as claimed in claim 1, wherein the step B of extracting data creates a copy of remote repository locally via Git clone command, and extracts all Git objects in the cloned open source project in bulk via Libgit 2.

5. The method as claimed in claim 4, wherein the data extraction specifically employs an Intel E5-2670 CPU with 36 nodes, each node having a CPU of 16 cores, and a 256GB memory cluster, each node having 16 threads; and (3) using a C language interface Libgit2 of Git to list all Git objects in the project, classifying the Git objects according to object types, and extracting the content of each object.

6. The code library design method for the rapid code tracing detection of the whole network as claimed in claim 1, wherein in step C, the cache database and the content database contained in each class of database can be specifically divided into 128 parts for parallel;

specifically, the method for realizing the parallelization by using the SHA1 value comprises the steps of dividing each type of database into 128 parts by using the last 7 bits of the first byte of the SHA1 value of a Git object, so that each of the four types of Git objects has 128 cache databases and 128 content databases, and each of the commit object and the tree object has 128 random lookup key value databases, and 128 × 4+2 databases are shared; these databases may be placed on a server to speed up parallelism.

7. The code library design method for the fast code tracing detection of the whole network as claimed in claim 6, wherein specifically, the size of a single content database is from 20MB of tag object to 0.8TB of blob object, the maximum size of a single cache database is tree object, and the size is up to 2 Gb.

8. The code base design method for the rapid code tracing detection of the whole network as claimed in claim 1, wherein the step D constructs the relational mapping with commit as the center, specifically comprising:

constructing a mutual mapping between the commit and the project, constructing a relation mapping between the commit and the author, and time, constructing a relation mapping from the author to the commit, constructing a mutual mapping from the commit to the code file blob, and a mutual mapping from the commit to the file name;

determining an item list containing a code file blob by the relation combination of the code file blob to commit and the commit to item; determining the creation time of the code file blob through the relation combination of the blob to commit and the commit to time of the code file; determining the author of the code file blob through the relation combination of the code file blob to commit and the commit to the author;

then, constructing a mutual relation mapping between the code file and the file name to support the tracing of the specific code fragment;

storing the relational maps by using a TokyoCabinet database in a partitioned manner, and particularly dividing each type of relational map into 32 sub-databases; for commit and code file blob, the last 5 bits of the first character of SHA1 are used for partitioning; for author, project and filename, the last 5 bits of the first byte of FNV-1Hash are used for partitioning.

9. The code library design method for the rapid full-network code traceability detection as claimed in claim 1, wherein step E is to recompile Libgit2 library after modifying git fetch by using method b; acquiring newly-added Git object data of the remote warehouse; the method comprises the following specific steps:

E1. initializing an empty Git warehouse;

E2. extracting SHA1 values and contents of all branch references of one warehouse from the constructed code library, and filling the SHA1 values and contents into an empty git warehouse;

E3. create a file named Branch name in the Git/refs/headers folder in the empty Git warehouse, and then write the SHA1 value for the commit of the Branch reference into the file;

and directly splicing the data of the newly added Git object into a corresponding content database according to the type and the SHA1 value of the Git object, and recording the SHA1 value, the offset and the size of the content file in a cache database, namely updating the corresponding relational mapping database.

10. The code library design method for rapid code tracing detection over the whole network as claimed in claim 9, wherein in step E2, the filling manner is as follows:

and constructing header information of the branch reference, wherein the format is as follows: object type + space + object content length + one dummy byte;

then splicing the head information and the original data, and compressing the spliced content by using a compress function of zlib;

finally, a subdirectory named SHA1 is created in the git/objects folder in the empty git repository, a file named SHA1 later 38 bits is created in the subdirectory, and the compressed content is written into the file.