CN111966718B

CN111966718B - System and method for data propagation tracking of application systems

Info

Publication number: CN111966718B
Application number: CN202010938679.4A
Authority: CN
Inventors: 吴云广; 王杰; 王丹; 周刚
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2024-03-15
Anticipated expiration: 2040-09-09
Also published as: CN111966718A

Abstract

Embodiments of the present description provide a system and method for data propagation tracking for application systems. In the system, a code compiling device compiles a code of a program source code of an application system to obtain a code compiling result. The code modeling means performs code modeling using the code compiling result to construct element information required for the stain analysis, the element information including a contamination start point, a contamination end point, and a program entry point. Then, the smear analysis apparatus performs smear analysis on the code compilation result using the constructed element information, and obtains data propagation path information of the application system, the data propagation path information being used to indicate a data flow direction relationship between a contamination start point and a contamination end point.

Description

System and method for data propagation tracking of application systems

Technical Field

Embodiments of the present disclosure relate generally to the field of security, software engineering, software compilation, or program analysis, and more particularly, to systems and methods for data propagation tracking for application systems.

Background

In recent years, there has been an increasing need in the industry for static spot analysis techniques, particularly spot analysis tools with high scalability and accuracy. The taint analysis technique can help industry track data propagation links, thereby solving data problems in many complex scenarios, such as privacy disclosure, asset analysis, change management, data consistency, etc. How to realize data propagation tracking in an application system becomes a problem to be solved urgently.

Disclosure of Invention

In view of the foregoing, embodiments of the present disclosure provide a data propagation tracking system and method for an application system. By using the data propagation tracking system and method, the stain analysis of the inter-process call of the application system can be realized, and the data propagation path information of the accessed data can be obtained, so that the data propagation tracking of the accessed data is realized.

According to one aspect of embodiments of the present specification, there is provided a system for data propagation tracking of an application system, comprising: the code compiling device is used for compiling codes of the program source codes of the application system to obtain code compiling results; code modeling means for performing code modeling using the code compiling result to construct element information required for stain analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and the stain analysis device is used for carrying out stain analysis on the code compiling result by using the constructed element information to obtain data transmission path information of the application system, wherein the data transmission path information is used for indicating the data flow direction relation between the pollution starting point and the pollution ending point.

Optionally, in one example of the above aspect, the data propagation path information is a data flow direction relationship between pairs of fields, and the fields include code fields or database fields.

Optionally, in one example of the above aspect, the system further comprises: and the data storage device is used for storing the data propagation path information of the application system into a database.

Optionally, in one example of the above aspect, the stored data propagation path information is constructed as a data flow graph.

Optionally, in one example of the above aspect, the application system includes a plurality of application systems, and the data flow graph includes a data flow graph across application systems constructed by linking data propagation path information of the plurality of application systems.

Optionally, in one example of the above aspect, the system further comprises: and the path information query device is used for responding to the data propagation path information query request, performing data propagation path information query in the database and providing a data propagation path information query result.

Optionally, in one example of the above aspect, the path information query device includes: a path information query interface used by a user to input a path information query request; and a visual presentation unit that visually presents the queried data propagation path information to the user.

Optionally, in one example of the above aspect, the system further comprises: and the distributed scheduling device is used for performing distributed scheduling on the stain analysis tasks of the application system.

Optionally, in one example of the above aspect, the code compiling apparatus further performs a package supplementing process on the code compiling result.

Optionally, in one example of the above aspect, the application framework of the application system is a Sofa framework, and the code modeling apparatus includes: the configuration file scanning unit scans the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizes the class file according to a topological structure to obtain an SOA model topology; the SQL conversion unit is used for converting the SQL-like sentences in the SQL configuration file into analyzable SQL sentences; the SQL analysis unit is used for analyzing the analyzable SQL sentences in the converted SQL configuration file into tables and fields; the element construction unit is used for carrying out code analysis on the data access layer and the application framework and constructing element information of the code layer; and the association mapping unit is used for carrying out association mapping on the element information constructed based on the data access layer and the fields in the parsed SQL statement in the SQL configuration file.

Optionally, in one example of the above aspect, when the application system is a Java-based application system, the code modeling apparatus further includes: and the byte code transformation unit is used for transforming the byte code of the code compiling result.

Optionally, in one example of the above aspect, the stain analysis device includes: a control flow graph generation unit that generates a control flow graph from a call relationship graph constructed from application layer codes among program codes of the application system by using a first call relationship construction algorithm; the stain analysis unit uses the control flow graph to traverse the program codes of the application system for stain analysis; an edge relation expanding unit for expanding edge relation for the call statement in the call relation graph and the control flow graph by using a second call relation construction algorithm when the stain analysis result indicates that the call statement does not have the edge relation in the call relation graph; and a data propagation path information determining unit that determines data propagation path information of the application system according to the expanded control flow graph.

According to another aspect of embodiments of the present specification, there is provided a method for data propagation tracking of an application system, comprising: performing code compiling on a program source code of an application system to obtain a code compiling result; constructing element information required by stain analysis by using the code compiling result to carry out code modeling, wherein the element information comprises a pollution starting point, a pollution ending point and a program entry point; and performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution ending point.

Optionally, in one example of the above aspect, the method further comprises: and storing the data propagation path information of the application system into a database.

Optionally, in one example of the above aspect, the data propagation path information is constructed as a data flow graph.

Optionally, in one example of the above aspect, the method further comprises: in response to the data propagation path information query request, performing a data propagation path information query in the database, and providing a data propagation path information query result.

Optionally, in one example of the above aspect, before performing the stain analysis on the code compilation result, the method further includes: and carrying out distributed scheduling on the stain analysis tasks of the application system.

Optionally, in one example of the above aspect, the method further comprises: and carrying out supplementary package processing on the code compiling result before constructing element information required by stain analysis according to the code compiling result.

Optionally, in one example of the above aspect, constructing the element information required for the stain analysis according to the code compiling result includes: scanning the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizing the class file according to a topological structure to obtain an SOA model topology; converting the SQL-like sentences in the SQL configuration file into analyzable SQL sentences; resolving the resolvable SQL sentences in the converted SQL configuration file into tables and fields; performing code analysis on the data access layer and the application framework to construct element information of the code layer; and carrying out association mapping on the element information constructed based on the data access layer and the fields in the parsed SQL statement in the SQL configuration file.

Optionally, in one example of the above aspect, performing the stain analysis on the code compilation result using the constructed element information includes: generating a control flow graph according to a call relation graph, wherein the call relation graph is constructed according to application layer codes in program codes of the application system by using a first call relation construction algorithm; traversing program code of an application system using the control flow graph for taint analysis; when the stain analysis result indicates that the call statement does not have an edge relation in the call relation graph, a second call relation construction algorithm is used for expanding the edge relation for the call statement in the call relation graph and the control flow graph; and determining the data propagation path information of the application system according to the expanded control flow graph.

According to another aspect of embodiments of the present specification, there is provided an electronic device including: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the data propagation tracking method as described above.

According to another aspect of embodiments of the present description, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a data propagation tracking method as described above.

Drawings

A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.

Fig. 1 shows an example schematic diagram of a private data disclosure process.

FIG. 2 illustrates an example block diagram of a system for implementing data propagation tracking for an application system in accordance with an embodiment of this specification.

FIG. 3 shows a block diagram of one implementation example of a code modeling apparatus according to an embodiment of the present specification.

FIG. 4 shows an example flow chart of a code modeling process according to an embodiment of the present description.

Fig. 5 shows a block diagram of one implementation example of a spot analysis apparatus according to an embodiment of the present disclosure.

FIG. 6 illustrates an example flow chart of a process for data propagation analysis of code compilation results according to an embodiment of the present description.

Fig. 7 shows an example schematic diagram of a process for spot analysis of program code of an application according to an embodiment of the present description.

Fig. 8 shows an exemplary schematic diagram of data propagation path information according to an embodiment of the present specification.

FIG. 9 illustrates an example schematic diagram of a dataflow graph that spans an application system according to an embodiment of the present description.

FIG. 10 illustrates an example flow chart of a method for implementing data propagation tracking for an application system according to an embodiment of this specification.

Fig. 11 shows a schematic diagram of an electronic device for implementing data propagation tracking for an application system according to an embodiment of the present description.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.

As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.

In industrial applications, there are a large number of inter-process calls (e.g., service layer methods call dao layer interfaces to obtain data in databases) within a single application, and there are service calls (e.g., service calls made by rpc, rest) between applications, that is, data in a single application may be propagated to other applications by way of service calls. If the data is illegally used by the caller, data security problems such as privacy leakage, asset damage and the like can be caused, so that tracking analysis is required to be carried out on data transmission during process call of an application system, data transmission path information of accessed data is obtained, and further data security risks can be found and dealt with in time.

Fig. 1 shows an example schematic diagram of a private data disclosure process. As shown in fig. 1, it is assumed that the data column IDCard in the database possessed by the application app_1 is labeled as private data. In response to a remote procedure call from application app_2, the privacy data of the data column IDCard is retrieved from the database and sent to application app_2 after passing through several conversion layers (POJO conversion layers). The application app_2 further exposes the private data to other applications. Finally, the application app_n gets the privacy data, stores it as a data column IDinfo in its own database and shows it to the user. In this case, if the user applying app_n does not know that the data column IDinfo originates from the private data idurd, there is a security risk that the private data is abused.

The smear analysis technique is widely used for data propagation trace analysis. The smear analysis technique refers to a technique of analyzing the spread of data in a program. The stain analysis is an important means for analyzing privacy disclosure and code loopholes in the field of data security, and has very wide application in the fields of security and software engineering. The stain analysis process mainly comprises three aspects of pollution source marking, pollution transmission rule assignment and stain transmission. Pollution sources refer to untrusted data, such as user sensitive data, untrusted external inputs. Pollution propagation rules are inference rules that specify how to spread pollution data based on the semantics of program instructions and functions. For example, a=source, b=a, sink=b, and data pollution of variable source affects sink data. Spot analysis techniques include static spot analysis and dynamic spot analysis.

Stain analysis includes three elements: pollution origin (Source), pollution end (Sink) and program analysis Entry (Entry Point). In the stain analysis process, a Call Graph (Call Graph) of Call between processes (functions) needs to be constructed according to a program analysis entry. Call Graph is used to present Call relationships between processes (functions) in a computer program. The nodes in the Call Graph consist of methods in program code, and the edges in the Call Graph are used to represent Call relationships between methods. Examples of the smear analysis technique may include static smear analysis tools flowdrop and Ptaint (based on Doop). In the stain analysis process of Flowdroid-based application systems, the object of the stain analysis is the source code or intermediate representation of the program, whereby explicit flow static analysis in the stain propagation can be translated into analysis for static data dependencies in the program.

When performing stain analysis, first, a Call Graph (Call Graph) is constructed according to the function Call relationship between programs for all program codes of an application program. Specific stain analysis is then performed in the function or within the function according to different program characteristics. Examples of explicit sticky propagation may include, but are not limited to, direct assignment propagation, propagation through function (procedure) calls, propagation through aliases (pointers), and so forth.

The term "stain analysis" refers in a narrow sense to stain analysis for data of interest. In this specification, the term "taint analysis" should be interpreted broadly as a taint analysis for all data referred to by program code or all accessed data. Furthermore, in this specification, the term "pollution" may be used interchangeably with "data dissemination". In addition, in the present specification, the term "application system" may also be understood as "application", "application program", or "system in which the application program is installed".

A system and method for implementing data propagation tracking for an application system according to embodiments of the present specification will be described in detail below with reference to the accompanying drawings.

Fig. 2 shows an example block diagram of a system (hereinafter referred to as a "data propagation tracking system") 200 for implementing data propagation tracking for an application system according to an embodiment of the present specification.

As shown in fig. 2, the data propagation tracking system 200 includes a code compilation device 210. The code compiling device 210 is configured to code compile program source code of an application system (e.g., in a code repository) to obtain a code compiling result. In the present specification, the data propagation path information is constructed by performing static smear analysis on program codes of an application system using a static smear analysis method. The object of the static stain analysis is a code compiling result (for example, a middle code) obtained by compiling a program source code of an application system. For example, in the case of a program code based on a Java implementation, the code compilation result is called a compiled jar packet, which contains byte codes obtained by compiling Java source code. In static stain analysis based on the flowdroid framework, the object of the static stain analysis is a jimple code, which is an intermediate code between the source code and the bytecode. In the flowdroid framework, java bytecode is converted into jimple code by using a boot framework.

Further, optionally, in one example, when the code is compiled from the program source code of the application system, the code compiling result may be further subjected to a repackaging process, so as to supplement some of the necessary program codes for static stain analysis for the code compiling result. In one example, the compiling behavior of the code compiling device 210 may be modified, so that the code compiling result may be supplemented with the necessary jar packets required for the static stain analysis. For example, in static stain analysis, it is found that the call is to a CE (the underlying container of the sofa framework), and the CE is not normally broken into jar packets, so that a padding process is required to forcefully break jar packets for such strongly dependent codes in static stain analysis, thereby preventing the occurrence of data stream analysis drop-out.

Further, optionally, in one example, the code compilation device 210 may perform on-demand compilation to convert program code into intermediate code or byte code. Here, the term "compilation on demand" means that the compilation object of the code compilation apparatus 210 is specified on demand, i.e., what system what branches of the system are compiled is specified to compile the program source code submitted by COMMITID when the code compilation apparatus 210 performs code compilation.

The data propagation tracking system 200 further comprises a code modeling means 220. The code modeling apparatus 220 is configured to perform code modeling using the code compiling result to construct element information required for stain analysis, including a pollution start point (Source), a pollution end point (Sink), and a program entry point. All the element information required for the stain analysis can be constructed using the code modeling apparatus 220.

In one example of the present specification, the input data may be considered a contamination start point and the output data may be considered a contamination end point. Examples of input data may include: input parameters of the program entry point, return values of the remote procedure call, fields retrievable by the database. Examples of output data may include: the return value of the program entry point, the parameters of the remote procedure call, and the fields that can be saved to the database. In one example of the present specification, the fields include code fields or database fields. Further, in one example, after the processing by the code modeling apparatus 220, an association mapping relationship is established between the element information of the code field type and the element information of the database field type.

Fig. 3 shows a block diagram of one implementation example of a code modeling apparatus 300 according to an embodiment of the present specification. As shown in fig. 3, the code modeling apparatus 300 includes a profile scanning unit 310, an SQL conversion unit 320, an SQL parsing unit 330, an element construction unit 340, and an association mapping unit 350.

The configuration file scanning unit 310 is configured to perform configuration file scanning on the code compiling result to obtain an SQL configuration file and a class file, and organize the class file according to a topology structure to obtain an SOA model topology. For example, under the Sofa framework based on a Java implementation, the profile scanning unit 310 may be configured to scan all XML profiles and annotations in the code compilation results, thereby yielding all SQLmap profiles and all Java beans (i.e., class files). The resulting Java beans may be organized together in topology, for example, by using BeanToplogy to organize the Java beans together in topology, resulting in a Service Oriented Architecture (SOA) model topology. An SOA is a component model that splits the different functional units of an application (called services) and links them by well-defined interfaces and protocols between these services. The resulting SQLmap configuration file and Java Bean may be loaded into memory for subsequent analysis. In one example of the present specification, the application framework of the application system may be a Sofa framework. The Sofa frame is an improved frame of the Spring frame and can be downward compatible with the Spring frame.

The SQL transformation unit 320 is configured to transform the SQL-like statements in the SQL configuration file into parseable SQL statements. In embodiments of the present description, the term "SQL-like statement" refers to an SQL statement that is either unresolvable or non-executable.

Alternatively, in one example, under the Java-based implementation of the Sofa framework, an ibatis/mybatis api may be utilized to convert the SQLmap configuration file into an ibatis/mybatis memory object. Subsequently, processing is performed for dynamic tag characteristics such as if in the SQLmap configuration file, thereby realizing variable usage extraction. The objects that pass the parameters to the data access layer (DAO layer) are then transformed to build a parameter set, thereby transforming the SQL-like statements in the SQLmap configuration file into SQL statements that can be directly executed or parsed at the database.

After the SQL conversion as above, the SQL parsing unit 330 is configured to parse the resolvable SQL statements in the converted SQL configuration file into tables and fields.

The element construction unit 340 is configured to perform code analysis on the data access layer and the application framework, and construct element information of the code layer. For example, under the Sofa framework based on Java implementation, DAO layer code may be analyzed, thereby obtaining a stain source/sink of DAO layer code. Further, the element information of the code layer is constructed by scanning the service issued in the sofa frame and the referenced service (the service issue becomes sofas service, the service reference is called sofas reference) to obtain the program Entry Point and the source/sink of the interface layer (upstream application system call) and call layer code (call downstream application system).

The association mapping unit 350 is configured to perform association mapping on element information constructed based on the data access layer and fields in the parsed SQL statement in the SQL configuration file, thereby establishing an association relationship between the DAO layer source/sink field and the database field.

Further, optionally, when the application system is a Java-implementation-based application system, the code modeling apparatus 300 may further include a bytecode modification unit (not shown). The byte code transformation unit transforms the byte code of the code compiling result, thereby avoiding the occurrence of data stream analysis break.

FIG. 4 illustrates an example flow diagram of a code modeling process 400 according to an embodiment of this specification.

As shown in fig. 4, at 410, the code compilation results are scanned for configuration files, resulting in SQL configuration files and class files, and the class files are organized by topology to obtain a model topology.

At 420, converting the SQL-like statement in the SQL configuration file into an resolvable SQL statement; and at 430, parse the parsed SQL statement in the converted SQL configuration file into tables and fields.

At 440, code analysis is performed on the data access layer and the application framework to construct element information for the code layer. Then, at 450, element information constructed based on the data access layer is mapped in association with fields in the parsed SQL statement in the SQL configuration file, thereby establishing an association relationship between the code DAO layer source/sink field and the database field.

Returning to FIG. 2, the data propagation tracking system 200 also includes a spot analysis device 230. The smear analysis apparatus 230 is configured to perform smear analysis on the code compilation result using the constructed element information, thereby obtaining data propagation path information of the application system. Here, the data propagation path information is used to indicate a data flow direction relationship between the contamination start point and the contamination end point. In one example of the present specification, the data propagation path information is a data flow relation between pairs of fields, and the fields include code fields or database fields. The code field is a field in an object of program code, consisting of "app_name", "service_name", "method_signature", "class_name", "field_name", e.g. a.service1.method1.requestclass. The database field is composed of "app_name", "db_name", "table_name", "column_name", e.g., c.db.table1.Id. In the example shown in fig. 1, the data propagation path information may be, for example, db.idcard→do.idcard→bo.idcard.

In one example of the present specification, the stain analysis may be implemented using a Flowdroid-based stain analysis tool. In another example, this can be implemented using a modified Flowdroid-based stain analysis method.

Fig. 5 shows a block diagram of one implementation example of a spot analysis apparatus 500 according to an embodiment of the present disclosure. As shown in fig. 5, the stain analysis device 500 includes a control flow graph generation unit 510, a stain analysis unit 520, an edge relationship expansion unit 530, and a data path information generation unit 540.

The control flow graph generation unit 510 is configured to generate a control flow graph from a call relation graph constructed from application layer code in program code of an application system by using a first call relation construction algorithm. A control flow graph is an abstract representation of a process, typically used in a compiler and static analysis, representing all paths that a program may traverse during execution. In embodiments of the present description, the control flow graph may also include inter-process control flows, such as call flow, return flow. Nodes in the control flow graph may consist of statements or basic blocks (basic blocks) in program code, edges representing the flow of operational control between nodes. In addition, when the first call relation construction algorithm is selected, only the algorithm precision is concerned, for example, an algorithm with high precision, such as a Spark algorithm, can be selected, and the algorithm performance is not required to be concerned.

The spot analysis unit 520 is configured to traverse the program code of the application system for spot analysis using the control flow graph.

In performing the stain analysis, an inter-process control flow Graph (ICFG) is first constructed based on the initial Call Graph. Subsequently, the stain propagation situation (data propagation situation) is calculated based on the ICFG. When a Call statement is encountered, it is checked whether the Call statement has an edge relationship in the initial Call Graph. If an edge relationship exists, the calculation continues downward.

The edge relationship extension unit 530 is configured to extend an edge relationship for a Call statement in the Call Graph and the control flow Graph using a second Call relationship construction algorithm if the Call statement is encountered and the Call statement does not have an edge relationship in the Call Graph. The second call relationship construction algorithm is less accurate than the first call relationship construction algorithm, but the second call relationship construction algorithm is better performing than the first call relationship construction algorithm. Examples of the second call relation construction algorithm may include, for example, the CHA algorithm.

The data propagation path information determining unit 540 is configured to determine data propagation path information of the application system according to the expanded control flow graph. In the embodiment of the present specification, the data propagation path is a path from a contamination start point to a contamination end point, as shown in fig. 7, x.f =source () - > sink (b.f) data propagation path.

FIG. 6 illustrates an example flow diagram of a process 600 for data propagation analysis of code compilation results, according to an embodiment of this disclosure.

As shown in fig. 6, at 610, a first Call Graph is constructed from application layer code in the program code of the application system using a first Call Graph construction algorithm. Subsequently, at 620, a control flow graph is generated from the initial call relationship graph.

After the control flow graph is generated as above, the control flow graph is used to traverse the program code of the application system for taint analysis at 630. At 640, when the taint analysis result indicates that the call statement does not have an edge relationship in the call relationship graph, a second call relationship construction algorithm is used to expand edge relationships for the call statement in the call relationship graph and the control flow graph.

Then, at 650, data propagation path information for the application system is determined from the expanded control flow graph.

In fig. 7, the diagram shown at the far left is an initial Call Graph constructed based on Main () and foo (). In the Call Graph, main (), foo (), source (), and Sink () are nodes, and a line between the respective nodes represents an edge. As shown in fig. 7, there is an edge relationship between main () and foo () and Sink (), and between foo () and source ().

The middle-shown diagram is a control flow graph, also known as an inter-process control flow (ICFG). In this control flow graph example, x=new X (), x.f =source (), return X, b=foo (a), and sink (b.f) are nodes, there is an edge relationship between b=foo (a) and x=new X (), return X, and sink (b.f), and there is an edge relationship between x=new X () and x.f =source (), x.f =source () and return X, and b=foo (a) and sink (b.f).

The rightmost diagram is a dataflow graph of procedure calls of an application system, which may also be referred to as a dataflow graph. In one example of the present specification, nodes in a dataflow graph are fields, and edges are data flows between fields, i.e., data propagation directions. In one example, the fields may include code fields or database fields.

By using the stain analysis method, the small-scale Call Graph is constructed only for the application layer code and part of the necessary library file codes in the program code, and the stain analysis is performed on the small-scale Call Graph, so that the workload of the stain analysis is greatly reduced, and the performance of the stain analysis is ensured. Thus, an efficient and highly accurate spot analysis scheme can be provided for large-scale enterprise applications, especially in the presence of a large amount of implicit dependencies caused by the use of native methods, libraries, and frameworks.

In addition, in the stain analysis method, the Call Graph is constructed by adopting a first Call relation construction algorithm with high precision on the application layer code, so that the overall accuracy of the constructed Call Graph can be improved. In addition, the second calling relation construction algorithm with relatively low precision and better performance is used for realizing the edge relation expansion aiming at the Call Graph and the control flow Graph, so that the missing edge repairing can be realized efficiently, and the recall rate is further ensured.

Fig. 8 shows an exemplary schematic diagram of data propagation path information according to an embodiment of the present specification. The data propagation path information of number 001 is data propagation path information obtained by performing a smear analysis on the application system a, the data propagation path information of number 002 is data propagation path information obtained by performing a smear analysis on the application system B, and the data propagation path information of number 003 is data propagation path information obtained by performing a smear analysis on the application system C.

Further optionally, in one example, the data propagation tracking system 200 may also include a data storage 240. The data storage 240 stores data propagation path information of the application system in a database. Further alternatively, in one example, the stored data propagation path information may be constructed as a dataflow graph.

For example, after the data propagation path information of each single application system is obtained as described above, the obtained data propagation path information of the single application system is stored in the relational database. Then, synchronizing to the offline data warehouse and finally resynchronizing to the graph database, thereby obtaining the data flow graph. The data flow graph is graph data composed of data propagation path information. When the application system includes a plurality of application systems, the data flow graph includes a data flow graph across application systems constructed by linking data propagation path information of the plurality of application systems. The obtained data flow diagram crossing the application system can be applied to application scenes such as data leakage, change management, data consistency check and the like.

FIG. 9 illustrates an example schematic diagram of a dataflow graph that spans an application system according to an embodiment of the present description. After analyzing each data propagation path information obtained for the application systems A, B and C in fig. 8, each obtained data propagation path information may be linked, thereby obtaining a data flow diagram across the application systems. The data flow diagram across application systems may reveal the data flow relationships of the data in the respective application systems.

In the data flow diagram example shown in fig. 9, 4 nodes N are included ₁ To N ₄ Wherein node N ₁ Represents "A.service1.method1.RequestClass.buyerid", node N ₂ Representing "B.service2.method2.RequestClass.priplialid", node N ₃ Represents "C.service3.method3.RequestClass.id" and node N ₄ Represents "c.db.tab 1.Id".

Further optionally, in one example, the data propagation tracking system 200 may also include a path information querying device 250. The path information query means 250 is configured to perform a data propagation path information query in the database in response to a data propagation path information query request, and to provide a data propagation path information query result, for example, to a user who issued the query request.

Optionally, in one example, the path information query apparatus 250 may include a path information query interface and a visual presentation unit. The path information query interface is used by a user to input a path information query request. For example, the path information query interface may be implemented as an API interface. The visual presentation unit is configured to present the queried data propagation path information to a user in a visual manner.

Further optionally, in one example, the data propagation tracking system 200 may also include a distributed scheduler 260. The distributed scheduler 260 is configured to perform distributed scheduling of the spot analysis tasks of the application system.

Alternatively, in one example, the distributed scheduler 260 may employ a layer 2 distributed scheduling policy for distributed scheduling. The first layer is an application and the second layer is a slice. For example, the number of application systems to be analyzed is 1000, and the distributed scheduler 260 first collects N applications for first-tier distribution, thereby distributing the applications to N servers for stain analysis, and guaranteeing that each server is distributed to one application as much as possible. Each server then runs a code modeling process. Further, at the final stage of the code modeling process, a slice (slice) process is performed. The purpose of slicing is to reduce the code complexity into multiple parts, minimizing hot spots. Because static stain analysis consumes very memory, once hot spots occur, i.e., path tracking of a few fields is very complex, insufficient memory, i.e., memory overflow, can easily occur, so that different application systems can cut different numbers of slices according to the respective code complexity, and then the slices can be used by the distributed scheduler 260 to perform second tier distribution and thrown to N servers, where each server is split into one slice as much as possible. Then, the spot analysis apparatus 230 is activated to perform field-based static spot analysis and static code tracking. When a server analysis is completed, the distributed scheduler 260 schedules the next slice analysis in this manner, and continues until all analyses are completed. In one example, two queues may be generated, an application queue and a slice queue, and the application and slice to be analyzed are loaded into the application queue and slice queue in a FIFO manner, from which the distributed scheduler 260 obtains the application and slice to perform distributed scheduling. Further, the distributed scheduling of the distributed scheduling apparatus 260 may employ a Pipeline (Pipeline) mechanism.

Further optionally, in one example. The distributed scheduling of the distributed scheduler 260 employs a policy that the distributed scheduler 260 ensures that applications and slices are distributed to each server as evenly as possible. However, there are times when 2 applications or 2 slices are thrown to the same server, and the analysis consumes memory and CPU very, so that the same server cannot be allowed to start 2 tasks simultaneously, and thus a single-machine lock mechanism is adopted in the distributed scheduling process, that is, when the analysis is not finished, one server directly skips if new analysis tasks are distributed, and the skipped applications or slices are put in a cache queue, so that many skipped tasks occur, and the tasks are not performed with stain analysis. In view of this, a rebate thread may also be added during the distributed scheduling process. Specifically, if a server is idle for a period of time (i.e., the distributed scheduler has not properly distributed analysis tasks to the idle server), then a replay thread is actively invoked to pull skipped tasks out of the cache queue one by one in order to begin analysis until the cache queue is empty.

A data propagation tracking system according to an embodiment of the present specification is described above with reference to fig. 1 to 9. By using the data propagation tracking system, the taint analysis of the inter-process call of the application system can be realized, and the data propagation path information of the accessed data can be obtained, so that the data flow tracking of the accessed data is realized.

Fig. 10 shows an example flowchart of a method 1000 for implementing data propagation tracking for an application system (hereinafter referred to as a "data propagation tracking method") according to an embodiment of the present specification.

As shown in fig. 10, at 1010, a program source code of an application system is code compiled to obtain a code compilation result.

At 1020, code modeling is performed using the code compilation results to construct element information required for the taint analysis, including a contamination start point, a contamination end point, and a program entry point.

And at 1030, performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution ending point. Optionally, in one example, the data propagation path information is a data flow relationship between pairs of fields, and the fields include code fields or database fields.

At 1040, the data propagation path information for the application system is stored in a database. In one example, the stored data propagation path information may be constructed as a dataflow graph. Where the application system includes a plurality of application systems, the constructed dataflow graph may include a dataflow graph that spans the application systems constructed by linking data propagation path information of the plurality of application systems.

At 1050, in response to the data propagation path information query request, a data propagation path information query is performed in the database and data propagation path information query results are provided.

Further, optionally, before performing the stain analysis on the code compilation result, the data propagation tracking method may further include: and carrying out distributed scheduling on the stain analysis tasks of the application system.

Further, optionally, before constructing the element information required for the stain analysis according to the code compiling result, the data propagation tracking method may further include: and carrying out package supplementing processing on the code compiling result.

Further, it is noted that what is shown in fig. 10 is merely an example embodiment, and in other embodiments of the present description, one or both of the operations of 1040 and 1050 may not be included.

As described above with reference to fig. 1 to 10, the data propagation tracking method and the data propagation tracking apparatus according to the embodiments of the present specification are described. The above data propagation tracking means may be implemented in hardware, or in software, or in a combination of hardware and software.

Fig. 11 shows a schematic diagram of an electronic device 1100 for implementing data propagation tracking of an application system according to an embodiment of the present description. As shown in fig. 11, the electronic device 1100 may include at least one processor 1110, memory (e.g., non-volatile memory) 1120, memory 1130, and a communication interface 1140, and the at least one processor 1110, memory 1120, memory 1130, and communication interface 1140 are connected together via a bus 1160. At least one processor 1110 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.

In one embodiment, computer-executable instructions are stored in memory that, when executed, cause at least one processor 1110 to: performing code compiling on a program source code of an application system to obtain a code compiling result; constructing element information required by stain analysis by using the code compiling result to carry out code modeling, wherein the element information comprises a pollution starting point, a pollution ending point and a program entry point; and performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution ending point.

It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1110 to perform the various operations and functions described above in connection with fig. 1-10 in various embodiments of the present specification.

According to one embodiment, a program product such as a machine-readable medium (e.g., a non-transitory machine-readable medium) is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-10 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.

Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.

It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the invention. Accordingly, the scope of the invention should be limited only by the attached claims.

It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.

In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.

The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A system for data propagation tracking of an application system, comprising:

the code compiling device is used for compiling codes of the program source codes of the application system to obtain code compiling results;

code modeling means for performing code modeling using the code compiling result to construct element information required for stain analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and

a stain analysis device for performing a stain analysis on the code compiling result using the constructed element information to obtain data propagation path information of the application system, the data propagation path information being used to indicate a data flow direction relationship between a contamination start point and a contamination end point,

wherein the code modeling apparatus includes:

the model topology creating unit scans the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizes the class file according to a topological structure to obtain an SOA model topology;

the SQL conversion unit is used for converting the SQL-like sentences in the SQL configuration file into analyzable SQL sentences;

the SQL analysis unit is used for analyzing the analyzable SQL sentences in the converted SQL configuration file into tables and fields;

The element construction unit is used for carrying out code analysis on the data access layer and the application framework and constructing element information of the code layer; and

an association mapping unit for performing association mapping on the element information constructed based on the data access layer and the fields in the parsed SQL statement in the SQL configuration file,

the stain analysis device includes:

a control flow graph generation unit that generates a control flow graph from a call relationship graph constructed from application layer codes among program codes of the application system by using a first call relationship construction algorithm;

the stain analysis unit uses the control flow graph to traverse the program codes of the application system for stain analysis;

an edge relation expanding unit for expanding edge relation for the call statement in the call relation graph and the control flow graph by using a second call relation construction algorithm when the stain analysis result indicates that the call statement does not have the edge relation in the call relation graph; and

and a data propagation path information determining unit for determining the data propagation path information of the application system according to the expanded control flow graph.

2. The system of claim 1, wherein the data propagation path information is a data flow relationship between pairs of fields, and the fields include code fields or database fields.

3. The system of claim 1 or 2, further comprising:

and the data storage device is used for storing the data propagation path information of the application system into a database.

4. The system of claim 3, wherein the stored data propagation path information is constructed as a data flow graph.

5. The system of claim 4, wherein the application system comprises a plurality of application systems, and the dataflow graph comprises a dataflow graph that spans application systems that is constructed by linking data propagation path information of the plurality of application systems.

6. The system of claim 3, further comprising:

and the path information query device is used for responding to the data propagation path information query request, performing data propagation path information query in the database and providing a data propagation path information query result.

7. The system of claim 6, wherein the path information query means comprises:

a path information query interface used by a user to input a path information query request; and

and the visual presentation unit is used for visually presenting the queried data propagation path information to a user.

8. The system of claim 1, further comprising:

And the distributed scheduling device is used for performing distributed scheduling on the stain analysis tasks of the application system.

9. The system according to claim 1, wherein the code compiling means further performs a supplementary processing on the code compiling result.

10. The system of claim 1, wherein when the application system is a Java-based implementation of the application system, the code modeling apparatus further comprises:

and the byte code transformation unit is used for transforming the byte code of the code compiling result.

11. A method for data propagation tracking of an application system, comprising:

performing code compiling on a program source code of an application system to obtain a code compiling result;

constructing element information required by stain analysis by using the code compiling result to carry out code modeling, wherein the element information comprises a pollution starting point, a pollution ending point and a program entry point; and

performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution ending point,

the element information required for constructing the stain analysis according to the code compiling result comprises the following steps:

Scanning the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizing the class file according to a topological structure to obtain an SOA model topology;

converting the SQL-like sentences in the SQL configuration file into analyzable SQL sentences;

resolving the resolvable SQL sentences in the converted SQL configuration file into tables and fields;

performing code analysis on the data access layer and the application framework to construct element information of the code layer; and

performing association mapping on element information constructed based on the data access layer and fields in the parsed SQL statement in the SQL configuration file,

performing a stain analysis on the code compilation result using the constructed element information includes:

generating a control flow graph according to a call relation graph, wherein the call relation graph is constructed according to application layer codes in program codes of the application system by using a first call relation construction algorithm;

traversing program code of an application system using the control flow graph for taint analysis;

when the stain analysis result indicates that the call statement does not have an edge relation in the call relation graph, a second call relation construction algorithm is used for expanding the edge relation for the call statement in the call relation graph and the control flow graph; and

And determining the data propagation path information of the application system according to the expanded control flow graph.

12. The method of claim 11, wherein the data propagation path information is a data flow relationship between pairs of fields, and the fields include code fields or database fields.

13. The method of claim 11 or 12, further comprising:

and storing the data propagation path information of the application system into a database.

14. The method of claim 13, wherein the data propagation path information is constructed as a data flow graph.

15. The method of claim 14, wherein the application system comprises a plurality of application systems, and the dataflow graph comprises a dataflow graph that spans application systems that is constructed by linking data propagation path information of the plurality of application systems.

16. The method of claim 13, further comprising:

in response to the data propagation path information query request, performing a data propagation path information query in the database, and providing a data propagation path information query result.

17. The method of claim 11, prior to performing the stain analysis on the code compilation result, the method further comprising:

And carrying out distributed scheduling on the stain analysis tasks of the application system.

18. The method of claim 11, further comprising:

and carrying out supplementary package processing on the code compiling result before constructing element information required by stain analysis according to the code compiling result.

19. An electronic device, comprising:

at least one processor, and

a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 11 to 18.

20. A machine readable storage medium storing executable instructions that when executed cause the machine to perform the method of any of claims 11 to 18.