CN118092885B

CN118092885B - Code frame method based on front-end and back-end plug-in architecture

Info

Publication number: CN118092885B
Application number: CN202410467526.4A
Authority: CN
Inventors: 张煇; 李龙; 杨勇; 成志伟
Original assignee: Changhe Information Co ltd; Beijing Changhe Digital Intelligence Technology Co ltd
Current assignee: Changhe Information Co ltd; Beijing Changhe Digital Intelligence Technology Co ltd
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-07-02
Anticipated expiration: 2044-04-18
Also published as: CN118092885A

Abstract

The application discloses a code frame method based on front-end and back-end plug-in architecture, which relates to the field of code development and comprises the following steps: generating source codes; submitting the source code to a distributed version control system Git for storage management; slicing the source code by adopting a program slicing method; storing the sliced source codes and the corresponding metadata into a relational database; mirroring the metadata in the relational database to the memory database; constructing a sandboxed execution engine based on a virtualization method, loading source codes, and acquiring corresponding metadata by utilizing a least recently used algorithm LRU; after the acquired source codes and metadata are assembled, executing the source codes in a sandboxed execution engine; wherein, the sandboxed execution engine adopts APACHE SPARK distributed computing frames; adopting a checkpoint mechanism checkpoint and a data copy strategy to protect data and an execution state in the execution process; aiming at the problem of low code development efficiency in the prior art, the application improves the code development efficiency.

Description

Code frame method based on front-end and back-end plug-in architecture

Technical Field

The application relates to the technical field of code development, in particular to a code frame method based on front-end and back-end plug-in architecture.

Background

In recent years, the software industry continues to evolve at a high rate, and enterprises have placed higher demands on software delivery rates. However, the traditional code development mode has the problems of low efficiency, long development period, and lack of effective coordination when a plurality of persons cooperate, so that the requirement of quick iteration of enterprises is difficult to meet. How to improve the software development efficiency becomes a difficult problem to be solved in the current urgent need.

However, conventional software development processes are often subject to cumbersome operations, inefficient tools, and complex management systems, resulting in problems of low code development efficiency. Developers often spend a great deal of time on repetitive and mechanical work, manually writing a great deal of repetitive codes, handling version conflicts, building and maintaining independent development environments, and the like, which seriously affect the efficiency and quality of software development.

In the related art, for example, in chinese patent document CN115878095A, a low-code development method, apparatus, device and medium based on logic arrangement are provided, and the method relates to the field of application development, and includes: when the component state is design and a script editor expanding instruction is acquired, displaying a script editor comprising a preset grammar prompting tree, a grammar help area and a code editor into a preset interface, and storing a logic programming script generated by logic programming codes received by the code editor; when the component state is running, determining a target calling party based on the logic programming code, calling a target engine corresponding to the logic programming script by using the target calling party, generating a grammar tree based on the logic programming script, executing the grammar tree by using the target engine, and returning an execution result to the target calling party; the target engine is a back-end script engine or a front-end script engine. However, although the programming difficulty is reduced by the logic programming, a large number of drag and connection operations may be required for implementing the complex logic, and the graphical programming manner may not be intuitive when the complex logic is expressed, so that development efficiency is reduced.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low code development efficiency in the prior art, the application provides a code framework method based on front-end and back-end plug-in architecture, which is used for generating source codes through a script editor, carrying out storage management by combining a distributed version control system and the like, and improving the code development efficiency.

2. Technical proposal

The aim of the application is achieved by the following technical scheme.

The embodiment of the specification provides a code frame method based on a front-end and back-end plug-in architecture, which comprises the following steps: adopting a script editor to develop codes and generating source codes; submitting the generated source code to a distributed version control system for storage management; slicing the source code by adopting a program slicing method; loading the sliced source codes into a relational database for storage; storing metadata corresponding to the source codes into a relational database; mirroring the metadata in the relational database to the memory database; the relational database and the memory database adopt an asynchronous replication mode based on log transmission to carry out data synchronization, and the method comprises the following steps: when the metadata of the relational database is changed, asynchronously transmitting a change log to the memory database; after receiving the metadata change log, the memory database updates the metadata in the memory; when the metadata of the memory database is changed, asynchronously transmitting a change log to the relational database; after receiving the metadata change log, the relational database updates metadata in the database;

Constructing a sandboxed execution engine based on a virtualization method, loading source codes from a relational database, and acquiring corresponding metadata from a memory database by utilizing a least recently used algorithm LRU, wherein the sandboxed execution engine specifically comprises the following steps: the sandboxed execution engine loads a certain amount of source codes to the local area in batches from the relational database according to the source code positions required to be executed; when executing the source code, acquiring metadata corresponding to the source code from a memory database by utilizing an LRU algorithm according to the metadata index in the source code; if the metadata corresponding to the source codes are not found in the memory database, the metadata are loaded from the relational database and cached in the memory database, and the least recently used metadata in the memory database are replaced; after the source codes loaded from the relational database and the metadata acquired from the memory database are assembled, executing codes in a sandboxed execution engine; wherein, the sandboxed execution engine adopts APACHE SPARK distributed computing frames; adopting a checkpoint mechanism checkpoint and a data copy strategy to protect data and execution states in the execution process in APACHE SPARK distributed computing frames, including: in the Spark task execution process, periodically persisting intermediate results of the RDD data set to reliable storage through a checkpoint mechanism checkpoint; in the Spark task execution process, creating multiple copies of partition data in the RDD data set, and distributing the copies to different nodes so as to prevent data loss caused by single-point faults; during Spark task execution, the task execution state is periodically persisted to reliable storage through a checkpoint mechanism checkpoint so that the task execution can resume from a checkpoint when the task fails.

Wherein the distributed version control system is a code version management system that allows multiple developers to process the same project in different code repositories and synchronize the respective modifications with each other. Git is a distributed version control system. In the application, the source code is managed by adopting the distributed version control system, so that the collaborative development of multiple persons can be easy, and the code modification of different developers can be shared and integrated quickly and transparently. Redundant backups of the code repository may also be implemented. In the present application, the distributed version control system may also employ mercuric or Bazaar.

Wherein metadata is data describing data, which records information of characteristics, attributes, contents, and the like of the data. In code management, metadata may be information describing source code, such as function names, parameters, variable types, and the like. In the application, metadata can make codes easier to read and understand the purpose of the codes. Metadata needs to be loaded when executing code in order to parse the code. Metadata facilitates retrieval and management of the code, such as looking up the code based on function name. Metadata may be used for code multiplexing, such as looking up code fragments with some semantics.

Specifically, mirroring metadata in a relational database to an in-memory database may employ: the timing task is set, and the latest metadata is queried from the relational database at regular intervals (10 minutes). And writing the latest metadata queried in the relational database into a related table or key of the memory database, and covering old data before the memory database. The in-memory database stores metadata in the form of key-values, which may be names or identifiers of metadata, with values being the corresponding metadata content. A trigger may also be established that automatically triggers synchronization of changes to the memory database when metadata in the relational database changes. A message queue may also be used, and when the relational database metadata changes, a message is sent, the consumer obtains the change from the message queue, and updates the in-memory database. By means of the cache mechanism of the memory database, the speed of acquiring metadata from the memory database is accelerated. When the memory database is down, the metadata can be directly reloaded from the relational database for recovery.

Where virtualization methods are techniques to create virtual versions instead of real versions. In a computer system, virtualization may virtualize physical hardware resources to be presented to a user. Common techniques for virtualization include hardware-assisted virtualization, operating system level virtualization, and the like. In the application, a sandboxed execution engine is constructed by adopting a virtualization method, and the aim is to simulate the complete running environment of code execution by a virtualization means, rather than directly running the code on a physical machine.

Wherein the sandbox is a virtual controlled isolation environment. The sandboxed execution engine runs code in an isolated environment using a sandboxed mechanism to control and limit execution of the code, preventing damage to the system by erroneous or malicious code. In the present application, a secure and controllable environment is created for untrusted code execution. Isolating access to computing resources, file systems, etc. by code execution processes. When a problem occurs, the sandboxes can be destroyed without affecting the system. Providing a degree of controllable resource limitation such as CPU, memory, etc.

Specifically, the asynchronous replication mode of log transmission: the method is a technical scheme for realizing data synchronization between databases. The source database will record a log of data changes and the target database will periodically pull the changes recorded in the log and then apply those changes to effect a copy of the data from the source database. This process occurs asynchronously, i.e., the transactional commit of the source database does not need to wait for the replication to complete. The relational database records a data change log, and the memory database periodically requests to pull the log. The memory database analyzes the log and applies the data in the log to the database to realize the data replication from the relational database. The data synchronization of the two databases is asynchronous, and the transaction performance of the relational database is not affected. Data synchronization refers to maintaining data consistency between two or more data sets such that data in multiple data copies is kept updated synchronously. The change of metadata in the relational database is synchronized to the in-memory database. The changes of the source codes in the relational database are synchronized to the cache of the memory database. The modification of the metadata in the memory database is synchronized back to the relational database.

Specifically, in the application, front-end plug-in development: and (3) adopting a script editor to develop codes, generating source codes, and submitting the source codes to a distributed version control system Git for storage management. The plug-in of front-end development is reflected, and independent development, version management and collaborative work of the front-end code are realized through a script editor and a version control system.

Back-end plug-in execution: and constructing a sandboxed execution engine based on a virtualization method, loading source codes from a relational database, acquiring corresponding metadata from a memory database by utilizing a least recently used algorithm LRU, and executing the source codes in a sandboxed environment. The plug-in method and the device embody plug-in of back-end execution, and realize safe isolation, dynamic loading and execution of the back-end code through sandboxed execution engines and database storage.

Code slicing and assembly: slicing the source codes by adopting a program slicing method, storing the sliced source codes into a relational database, and storing metadata corresponding to the source codes into the relational database and a memory database. When executing, the source code and metadata are loaded from the database, assembled and executed in the sandboxed execution engine. The method embodies plug-in management of codes, and realizes modularization, reusability and dynamic combination of the codes through slicing and assembly.

Distributed computing framework: the sandboxed execution engine adopts APACHE SPARK distributed computing framework, and the distributed computing and fault-tolerant mechanism is utilized to improve the performance and reliability of code execution. The method embodies the fusion of the distributed computing framework and the code framework, and realizes high concurrency, high availability and scalability of code execution through the introduction of the distributed computing framework.

Database plug-in: and storing the source codes and the metadata by adopting a relational database, and mirroring the metadata into a memory database to realize the persistence and quick access of the data. Meanwhile, distributed storage and load balancing of the memory database are realized through the Redis cluster and the consistent hash algorithm. The method reflects the plug-in of the database, and realizes the diversity and high performance of data storage through the combination of the relational database and the memory database.

Further, a plurality of code bins are arranged, and each code bin corresponds to one source code item; setting a plurality of version management engine examples, wherein each engine example processes version management of one code bin; setting a mirror image synchronization mechanism between the code bins, and carrying out data bidirectional synchronization between the code bins, wherein the mirror image synchronization mechanism specifically comprises the following steps: submitting the source code item to a main code bin; synchronizing the newly submitted source code to other mirror code bins by the main code bin; after the mirror code bin receives the synchronous source code, updating the local source code version; synchronizing the modification back to the main code bin when the source code of the mirror code bin is modified; after the main code bin receives the source code modification of the mirror code bin, updating the local source code version; the main code bin synchronizes the modification to other mirror image code bins; the distributed collaborative work of the version management engine instance by adopting the master-slave mode comprises the following steps: setting a main version management engine instance and a plurality of standby version management engine instances; the source code is submitted to a main version management engine instance to carry out version management; the master version management engine instance synchronizes the version management log to the slave version management engine instance; when the main version management engine instance fails, a standby version management engine instance is promoted to be a new main version management engine, and version management service is continuously provided.

Where Code Repository (Code Repository) is a Repository for storing and managing source Code. Each code bin may contain all source code and resource files for a certain item. The code bin is an integral part of the version control system. In the application, each code bin corresponds to one source code item, and isolation among the items is realized. The developer submits source code to the corresponding code bin based on project requirements. The code bin, in combination with the version control system, manages project development histories and multiple development branches. The code bins can realize bidirectional synchronization through a mirror mechanism, so that code redundancy backup is formed. The code bin enables multi-project parallel development to be possible, and the development efficiency of source codes is improved. The code bins are divided based on the projects, so that the access rights of the codes are managed according to the projects. The code bins provide an atomized management unit for slicing, storing, executing, etc. the source code. The code bins play the role of item level source code management in this scheme, making code management more modular and efficient.

Where an engine instance refers to a single runtime process of one engine in a software system. For a distributed system architecture, multiple engine instances may be started to perform tasks together to improve performance and availability. In the present application, each version management engine instance manages version control of one code bin. The engine instance implementing functions for the code bin include: commit version, maintenance history, code check out, etc. The multiple engine instances can be executed in parallel according to code bin division, and version management efficiency is improved. The independent engine instance is only responsible for the corresponding code bin, so that task decoupling risk isolation is realized. When a single engine instance fails, other instances are not affected, improving system availability. The number of engine instances can be elastically expanded according to the requirement, and the lateral expansion of the version management capability is realized. By means of the engine instance, decoupling parallel processing of distributed version management tasks is achieved, and performance, usability and scalability of version management are improved.

The mirror synchronization mechanism refers to maintaining real-time data consistency between two or more data sets through a mirror relationship. When one of the data sets is updated, the changes are propagated synchronously into the mirrored data set. Mirror synchronization may be bi-directional. In the application, the code bins are arranged in a mirror image relationship, and one code bin can have a plurality of mirror image code bins. When the source code in one code bin is changed, the source code is automatically and synchronously changed into the mirror code bin. The source code in the mirror code bin maintains consistency with the source code of the original code bin. Mirror synchronization can be performed bi-directionally, with each other as a backup. The mirror image synchronization adopts an incremental mode, only changes the content synchronously, and improves the efficiency. The mirror synchronization realizes the redundant backup of the code bin, and improves the availability and fault tolerance of the code bin. The mirror code bin may act as a load policy, sharing the pressure of code store accesses. The mirror image synchronization mechanism realizes data synchronization among code bins, and is beneficial to improving the reliability and performance of code management.

The active-standby mode is a common high-availability solution. In the active/standby mode, the system simultaneously starts one active node and at least one standby node. The master node is responsible for processing all tasks, and the standby node only synchronizes the data change of the master node. If the main node fails, the standby node can rapidly take over the task, continue to provide service, and achieve rapid failover. In the present application, each code bin deploys one master engine instance responsible for version management. And deploying at least one corresponding standby engine instance, and synchronizing the data of the main instance. When the main instance fails, the standby instance immediately takes over the version management task. And consistency is kept between the main and standby examples through data synchronization. The main and standby examples can be deployed on different physical machines, so that fault tolerance is improved. And dynamically monitoring and automatically switching the master roles and the slave roles. The active-standby mode achieves high availability of version management tasks. The high availability of the version management engine is realized through the main and standby modes, single-point faults are avoided, and the reliability of the system is improved.

Further, setting a mirror synchronization mechanism between the code bins to perform bidirectional synchronization of data between the code bins, including: synchronization flow of code bin a: acquiring a time stamp A of a source code A in a code bin A; obtaining a hash value hash A of a current version of the code bin A; sending the time stamp A and the hash value hash A to a code bin B; receiving a time stamp B and a hash value hash B of a source code B sent by a code bin B; comparing the received timestamp B of the source code B with the hash value hash B with the acquired timestamp A and the hash value hash A; when the time stamp A and the time stamp B are the same, but the hash value hash A and the hash value hash B are different, judging that the source code is modified, and requesting to send the modified source code B to the code bin B; receiving a modified source code B sent by a code bin B; the source code a in code bin a is updated with the received source code B.

Synchronization flow of code bin B: acquiring a time stamp B of a source code B in a code bin B; obtaining a hash value hash B of the current version of the code bin B; receiving a time stamp A and a hash value hash A of a source code A sent by a code bin A; comparing the received timestamp A of the source code A with the hash value hash A with the obtained timestamp B and the hash value hash B; when the time stamp A and the time stamp B are the same, but the hash value hash A and the hash value hash B are different, judging that the source code is modified, and transmitting the modified source code B to the code bin A; receiving a request of a code bin A to send modified source codes B; the modified source code B is sent to a code bin A; receiving a time stamp A and a hash value hash A of a source code A sent by a code bin A; comparing the received timestamp A of the source code A with the hash value hash A with the obtained timestamp B and the hash value hash B; when the time stamp A and the time stamp B are different, or the hash value hash A and the hash value hash B are different, judging that the source code is modified, and requesting to send the modified source code A to the code bin A; receiving a modified source code A sent by a code bin A; the source code B in code bin B is updated with the received source code a.

Through the above flow, bidirectional synchronization of source codes is realized by comparing the time stamp with the hash value between the code bin A and the code bin B. The specific data flow is as follows: code bin a sends a timestamp a and a hash value hash a to code bin B. Code bin B sends a timestamp B and hash value hash B to code bin a. When code bin a determines that the source code has been modified, a request is made to code bin B to send modified source code B. The code bin B sends the modified source code B to the code bin a. When the code bin B judges that the source code has been modified, the modified source code B is actively sent to the code bin A. When code bin B determines that the source code has been modified, a request is made to code bin a to send modified source code a. The code bin a sends the modified source code a to the code bin B. Through this bi-directional data exchange and synchronization, code bin a and code bin B always maintain consistency of source code content.

Specifically, in a distributed version control system, each code repository computes a hash value for its managed source code files and directories for uniquely identifying the content of the source code. The hash value is calculated as follows: hash calculation is carried out on the source code file: the content of the read source code file is calculated using a hash algorithm (e.g., SHA-1, SHA-256, etc.) to obtain a hash value of the file, which is typically a string of fixed length that uniquely identifies the content of the file. Hash calculation is carried out on the source code catalogue: traversing all files and subdirectories under the catalog, calculating hash values of the files, combining the hash values of all the files under the catalog according to a certain sequence to obtain the hash value of the catalog, wherein the hash value of the catalog is also a character string with a fixed length and is used for uniquely identifying the content and the structure of the catalog. Hash calculation is carried out on the whole code warehouse: and taking the root directory of the code warehouse as a special directory, and carrying out hash calculation on the root directory to obtain a hash value of the whole code warehouse, wherein the hash value of the code warehouse represents the state and the content of the whole warehouse. In the code repository, a new version is generated and a hash value of the version is calculated for each commit (commit) operation. The hash value of a version is typically calculated based on the following factors: the hash value of the submitted source code file and directory, the submitted metadata information (such as the submitter, the time of submission, the description of submission, etc.), the hash value of the parent version (i.e., the last submitted hash value), in this way, each version has a unique hash value that can be used to identify and track the version's change history.

Further, slicing the source code by using a program slicing method includes: constructing a control flow graph of the source code through grammar analysis, wherein the control flow graph represents the execution flow of the source code; traversing a control flow graph, and inserting a global variable monitoring code at a position where a global scope variable appears for the first time in a source code; the global scope variable represents a variable defined in the global scope of the source code, not limited by any code block or function, and accessed or modified by any code position in the whole life cycle of program operation; the global variable monitoring code is used for recording the change of the value of the global scope variable in the source code executing process; executing a source code inserted with a global variable monitoring code, and acquiring the value change and the passing control flow path of all program variables in the source code during operation; the value change data of the program variable comprises variable names, variable values, variable types, variable scope and the like; the control flow path data comprises branches, loops, function calls and the like through which the source code executes; analyzing the dependency relationship among the variables in the source code according to the value change data and the control flow path data of the program variables; the dependency relationships between variables include data dependencies and control dependencies; data dependency means that the value of one variable is affected by the value of another variable; control dependence represents the value of a variable affected by the result of a conditional statement in the control flow path; slicing the source code into a plurality of independent, reusable code segments using dependencies between variables in the source code; each code segment contains a set of interdependent variables and statements; the dependency relationship among different code segments is as few as possible, so that the independence and reusability of the code segments are improved; and storing the sliced code segments into a relational database. The process of source code slicing is as follows: the source code is parsed to generate control flow graph data which represents the execution flow of the source code. The control flow graph data is used to guide the insertion location of global variable monitor code. The source code after the global variable monitor code is inserted is executed to generate value change data and control flow path data of the program variable. The value change data and control flow path data of the program variables are used to analyze the dependency relationship between the variables in the source code. Dependency data between variables in the source code is used to guide the source code slicing to generate independent, reusable code segments. The sliced code fragment data is stored in a relational database.

Wherein the control flow graph is a directed graph representing program run flow and structure. Each node represents a basic block of code, referring to a series of sequentially executed instructions in a program. The directed edges represent control flow transfer relationships between the underlying blocks. In the present application, the execution logic and path of code are represented. The entry point and the exit point of the code are analyzed. Find loop structures in the code, etc. The dependency of the different code segments is analyzed by means of control flows. The control flow graph may be used for code optimization, program analysis, and the like.

The global scope variable is defined in the global scope of the source code in the programming language, and is not limited to any code block or function scope, and can be accessed in the whole program running period. First appearance position: the global variable is the first defined or declared code location in the source code. Global variable monitoring code: code segments for tracking changes in the value of a recorded global variable during program execution are typically inserted at the first occurrence of the global variable. The variables: the computer program stores the name symbols of the data values, the values of the variables being modifiable by the program. In the application, the defined position of the global variable is found by analyzing the control flow graph of the source code. At the first appearance position of the global variable, a monitoring code is inserted to record the value change. The monitoring code outputs the value change information of the global variable for analysis.

The control flow path refers to a code path actually passed in the program executing process. The actual code path traversed by each execution instance may be different due to the existence of conditional branches, loops, etc. in the code. In the present application, the actual code path through which the program runs is tracked and recorded while the code is executed. The control flow path reflects the specific logical branches of execution of the running instance. Code coverage may be detected compared to a possible control flow path (control flow graph). The data controlling the flow path may be used for code optimization, error detection, etc. In connection with global variable monitoring, the change of variable values under different code paths can be analyzed.

Specifically, according to the dependency graph of the source code, the key data flow paths in the code can be analyzed. Code modules with strong dependencies have a close logical relationship and should be placed in one slice. The code modules with weak dependency relationship have loose logic connection and can be divided into different slices. And carrying out reachability analysis on the dependency relationship, and integrating strongly dependent codes to reduce interaction among slices. The granularity of the slices is divided according to actual requirements, so that cohesion is guaranteed and a correct boundary is also provided.

Further, constructing a control flow graph, including: analyzing the source code by adopting a recursion descent algorithm, and acquiring a grammar structure of the source code and a corresponding code offset address; the grammar structure comprises a sequence structure, a judging structure, a circulating structure and the like; the code offset address represents the location of the syntax structure in the source code; mapping the grammar structure as a key and the offset address as a value into line segment tree nodes to construct a line segment tree; the line segment tree is an efficient interval query data structure; the grammar structure corresponding to any position in the source code can be rapidly positioned through the line segment tree; locating the corresponding grammar tree nodes according to the code offset addresses through interval inquiry of the line segment tree; syntax tree nodes represent the syntax structure of source code; the sequence structure, the judging structure and the circulating structure in the source code can be obtained through the grammar tree nodes; extracting a sequence structure, a judging structure and a circulating structure in a source code according to the grammar tree nodes; mapping the extracted sentences in the sequence structure, the judging structure and the circulating structure into basic blocks; the basic block represents a branch-free sequential statement set and is used as a basic unit in a control flow graph; each basic block contains one or more sentences, and branch jumps do not exist among the sentences; analyzing the execution precedence relation and the conditional jump relation between the basic blocks; the execution precedence relationship represents the execution sequence of the basic blocks; the conditional jump relationship represents a branch jump condition between basic blocks; generating a control flow graph representing a source code control flow according to the basic block, the execution precedence relationship and the conditional jump relationship; the control flow graph is a directed graph, nodes represent basic blocks, and edges represent execution precedence relations and conditional jump relations among the basic blocks; the control flow graph describes the flow and logic structure of the execution of the source code.

The process of constructing the control flow graph is as follows: the source code is analyzed by a recursion descent algorithm to generate grammar structure data and code offset address data. The grammar structure data and the code offset address data are mapped into line segment tree nodes to construct a line segment tree. The code offset address data is used to perform interval query in the line segment tree to locate the corresponding syntax tree node. Syntax tree node data is used to extract sequential, predicate and loop structures in the source code. The extracted sequential structure, decision structure, and cyclic structure data are mapped into basic blocks. The basic block data is used to analyze execution precedence relationships and conditional jump relationships between basic blocks. The basic block data, execution precedence relationship data, and conditional jump relationship data are used to generate a control flow graph.

Wherein recursive descent is an algorithm for parsing. It identifies the syntactic structure of the source code by recursively invoking matching inputs with a syntactic rule. The application can analyze codes and generate analysis trees by adopting a recursion descent algorithm. The syntax structure of the source code needs to be obtained to support the subsequent static analysis. The code offset address of the syntax structure is recorded to identify the constituent structure of the source code. The source code is analyzed through a recursion descent algorithm to obtain a grammar structure and a corresponding offset address, and the structured representation of the source code can be carried out to support static analysis, compiling, code conversion and other processes, so that the method has wide application in the scenes of a source code editor, code analysis and the like.

The line segment tree is a binary search tree data structure for storing interval information. It divides an interval into a series of disjoint subintervals, each subinterval corresponding to a leaf node of the line segment tree. In the present application, the syntax structure of the source code is used as a key, and the corresponding code offset address section is used as a value and stored in the leaf node of the line segment tree. Whether an offset address is in a section corresponding to a certain grammar structure can be rapidly judged through the line segment tree. The interval query refers to searching whether a certain element exists in an interval or counting the number of elements in the interval in one data set. In the present application, a line segment tree representing the source code syntax structure has been constructed. A code offset address is entered. And inquiring the section on the line segment tree, and judging the section where the offset address is located. And determining the grammar structure node corresponding to the offset address through inquiry. The line segment tree supports efficient interval query with a time complexity of O (log n).

Wherein, the code structure executed line by line in sequence in the sequence structure program has no jump. The judging structure program selects a structure for executing different code branches, such as if-else, according to the condition judgment result. The loop structure may repeat the structure of executing a piece of code, such as for and while loops. In the application, the basic grammar structure in the source code can be identified by analyzing the grammar tree obtained. The sequential structure portion code is executed in sequence. The judging structure controls the execution flow according to the condition. The loop structure enables repeated iterations of the code. Identifying these three types of infrastructure, the control flow of the code can be further analyzed. Wherein the basic block is a set of code fragments consisting of sequentially executed statements, which have no branch or jump statements inside. In the present application, statements in a sequential structure are mapped to one basic block. And respectively mapping sentences of different branch paths of the judging structure into different basic blocks. The statements within the loop fabric are mapped to a basic block. Each basic block consists of a series of sequential statements, which can be executed sequentially. The basic blocks are connected through branch and jump sentences. The entire program flow graph may be represented as basic blocks and flow relationships between them.

The execution sequence relationship indicates the execution sequence among the basic blocks, namely, the next basic block directly executed after one basic block is executed. Conditional jump relation a certain basic block directly jumps to execute the relation of other basic blocks according to the result of conditional judgment. In the application, the execution sequence of basic blocks is analyzed, and a basic control flow structure is constructed. A subsequent base block is determined that each base block can directly execute. And finding out a basic block corresponding to the conditional judgment statement and different jump results according to the judgment result. A complete control flow relationship graph between the basic blocks is constructed. The control flow graph reflects various paths as the program runs.

Specifically, each basic block serves as a node of the control flow graph. The execution precedence relationship between the basic blocks is represented as a directed edge of the control flow graph. For conditional jump relationships, the directed edges of branches are drawn from the base blocks that determine the condition, connecting to different subsequent base blocks. And repeating the process to connect the precedence relationship and the conditional jump relationship of each basic block. For the circulation structure, a directed edge is added from the circulation outlet to the circulation inlet. A directed graph is ultimately formed that covers the entire program control flow. The control flow graph visually reflects various possible paths of program operation. Further program analysis may be performed on a control flow graph basis. Such as data flow analysis, program slicing, etc., require the use of control flow graphs. The control flow graph is combined with the program grammar tree, and can comprehensively reflect the program structure and the behavior.

Further, the line segment tree adopts a self-balancing binary search tree AVL, which includes: mapping the grammar structure and the code offset address into AVL tree nodes to construct an AVL tree; the AVL tree is a self-balancing binary search tree, so that the balance of the tree can be maintained, and the query efficiency is improved; each AVL tree node comprises grammar structure, code offset address and balance factor information; when a new grammar structure and a code offset address are inserted into the AVL tree, the balance of the tree is maintained through rotation operation; when a new node is inserted to cause unbalance of the AVL tree, the tree structure is adjusted through left-handed, right-handed or left-handed operation so as to be rebalanced; the balance factor represents the difference between the height of the left subtree and the height of the right subtree of the node and is used for judging the balance of the tree; through the searching operation of the AVL tree, the corresponding grammar structure is rapidly positioned according to the code offset address; the complexity of the lookup operation time of the AVL tree is O (log n), wherein n is the number of nodes in the tree; by comparing the code offset address with the offset address of the AVL tree node, the direction of the lookup can be determined until the target node is found or the lookup fails; and acquiring corresponding grammar structure information according to the found AVL tree node for subsequent control flow graph construction.

The process of adopting the AVL tree for the line segment tree is as follows: the syntax structure data and the code offset address data are mapped into AVL tree nodes, constructing an AVL tree. New syntax structure data and code offset address data are inserted into the AVL tree, maintaining tree balance through rotation operations. The code offset address data is used to perform a lookup operation in the AVL tree, locating the corresponding syntax structure. The found AVL tree node data is used to obtain corresponding syntax structure information. The acquired syntax structure information data is used to construct a control flow graph.

Further, mapping the obtained sentences in the sequence structure, the judging structure and the circulating structure into basic blocks comprises the following steps: for the sequential structure: mapping the sentence sequence in the sequential structure into a basic block G1; the sentences in the basic block G1 are sequentially executed in sequence, and branch jumps are not included; the sequence structure data is mapped and converted into basic block G1 data, and the basic block G1 data represents a statement sequence of sequence execution; for the judging structure: converting the judging structure into two parts of judging conditions and an executing body; mapping the judging condition into a basic block G2; mapping the execution body into a basic block G3; the basic block G2 comprises an evaluation statement of the judgment condition, and an execution flow is determined according to the true or false of the judgment condition; the basic block G3 contains a sentence sequence which is selectively executed according to the determined execution flow; the judging structure data is converted to generate judging condition data and executing body data; the judgment condition data is mapped and converted into basic block G2 data, and an evaluation statement of the judgment condition is represented; the execution body data is mapped and converted into basic block G3 data, and the basic block G3 data represents a sentence sequence for selecting execution; for the cyclic structure: converting the circulation structure into a circulation head part and a circulation body part; mapping the cyclic header into a basic block G4; mapping the cyclic body into a basic block G5; the basic block G4 contains an evaluation statement of the loop head condition, and whether to continue executing the loop is determined according to the true or false of the loop head condition; the basic block G5 comprises a statement sequence in a loop body, and the execution is repeated until the loop condition is false; the circulation structure data is converted to generate circulation head data and circulation body data; the cycle head data is mapped and converted into basic block G4 data, and the basic block G4 data represents an evaluation statement of the cycle head condition; the cyclic body data is mapped and converted into basic block G5 data, and the basic block G5 data represents statement sequences in the cyclic body.

The process of mapping the sequential structure, the decision structure, and the loop structure into basic blocks is as follows: the sequence structure data is mapped and converted into basic block G1 data, and flows to the subsequent control flow graph construction process. The judging structure data is converted to generate judging condition data and executing body data: the judging condition data is mapped and converted into basic block G2 data, and the basic block G2 data flows to the subsequent control flow diagram construction process; the execution volume data is mapped and converted into basic block G3 data, and the basic block G3 data flows to the subsequent control flow graph construction process. The cyclic structure data is converted to generate cyclic header data and cyclic body data: the cyclic header data is mapped and converted into basic block G4 data, and the basic block G4 data flows to the subsequent control flow graph construction process; the cyclic volume data is mapped and converted into basic block G5 data, and flows to the subsequent control flow graph construction process. Basic blocks G1, G2, G3, G4, G5 data flow to the control flow graph construction process for generating complete control flow graph data.

Furthermore, the relational database adopts a MySQL or PostgreSQL database for storing the sliced code fragment data; the sliced code segment data comprises information such as identification, code content, dependency relationship and the like of the code segment; the code segment data is converted into a table structure in a relational database through structuring; the code fragment data is inserted into a MySQL or PostgreSQL database through SQL sentences to realize persistent storage; the memory database adopts Redis and is used for storing metadata and index information of the code fragments; the metadata of the code segment comprises information such as identification, length, checksum and the like of the code segment; the index information of the code segment comprises information such as keywords, labels, dependency relations and the like of the code segment; the metadata and the index information are converted into a key value pair format supported by Redis through serialization; metadata and index information are stored into a Redis database through SET or HASH commands of the Redis, so that quick access and query are realized.

The storage process of the code segment data in the relational database and the memory database is as follows: the sliced code fragment data is subjected to structuring treatment and is converted into a table structure and SQL sentences supported by a relational database; the code fragment data is inserted into a MySQL or PostgreSQL database through SQL sentences to realize persistent storage; the code segment data are organized and managed in a table form in a relational database, and query and operation through SQL sentences are supported; the metadata and index information of the code segments are converted into a key value pair format supported by Redis through serialization; metadata and index information are stored in a Redis database through a SET or HASH command of the Redis, so that quick access and inquiry are realized; the metadata and index information are organized and managed in a Redis database in the form of key value pairs, and efficient read-write operation through Redis commands is supported.

Further, the memory database adopts Redis, including: a cluster of a plurality of Redis nodes is constructed, so that distributed storage and high-availability access of data are realized; each Redis node corresponds to an independent Redis instance and operates on different servers; data synchronization and backup are carried out between Redis nodes in a master-slave replication mode, so that the consistency and reliability of the data are ensured; the master node is responsible for processing the writing request of the data, and the slave node is responsible for processing the reading request of the data, so that the reading and writing separation is realized; when the master node fails, the slave node can be automatically switched to a new master node, so that the high availability of the cluster is ensured; when a data access request is sent to a cluster of a plurality of Redis nodes, a consistent hash algorithm is adopted to determine a target node for request mapping; the consistent hash algorithm calculates a hash value according to the key value of the request, and maps the hash value to a hash slot of the Redis node; each Redis node is responsible for processing a part of data requests in the hash slot range, so that load balancing of the requests is realized; when a certain Redis node fails, the hash slot which is responsible for the node can be automatically migrated to other available nodes, so that the accessibility of data is ensured; different data expiration time TTL is set in each Redis node, so that automatic elimination of data and memory optimization are realized; for different types of data, different expiration time TTLs are set, and the data are distinguished according to importance and access frequency of the data; when the storage time of the data exceeds the set expiration time TTL, the Redis automatically deletes the corresponding data and releases the memory space; by setting the reasonable expiration time TTL, excessive data in Redis can be prevented from affecting the performance and stability of the system.

The code segment storage and access process based on Redis cluster is as follows: the metadata and index information of the code segments are converted into a key value pair format supported by Redis through serialization; mapping the serialized data to a target Redis node through a consistent hash algorithm; the data writing request is sent to the master node, and the master node writes the data into the memory and synchronizes the data to the slave node; the data reading request is sent to the slave node, and the slave node reads data from the memory and returns the data to the requester; when the storage time of the data exceeds the set expiration time TTL, the Redis node automatically deletes the expired data and releases the memory space; when a certain Redis node fails, the hash slot which is responsible for the node automatically migrates to other available nodes, so that the accessibility of data is ensured.

Further, in the executing process, the data and the executing state are stored as checkpoint checkpoints according to a preset period; the preset period can be set according to the scale and the data volume of the task, for example, check points are created after the completion of data processing at intervals of a certain number; the check point data comprise the current data processing result and the execution state information of the task, such as the blood relationship of RDD, the progress of the task and the like; after being serialized, the check point data is stored in a distributed file system HDFS, so that the data is stored in a lasting manner; creating multiple copies of the checkpoint data and storing the copies in different nodes in a distributed manner; specifying a number of copies of the checkpoint data by setting a copy factor (replication factor); spark automatically copies check point data to different HDFS data nodes according to the setting of the copy factors, and redundant storage of the data is realized; the distributed storage of the copies can improve the availability and fault tolerance of the data, and even if part of nodes fail, the data can be recovered from the copies of other nodes; when the node fails or the task fails, recovering the execution state from the latest check point, and recovering the data from the copy of the failed node to continue executing the task; spark monitors the state of nodes in the cluster, and when detecting node failure or task failure, automatically triggers a fault-tolerant recovery mechanism; restoring the execution state of the task according to the latest check point data, wherein the execution state comprises the blood relationship of RDD, the progress of the task and the like; reading check point data from the copy of the failure node, recovering a data processing result, and continuously executing tasks from the breakpoint; fault tolerance recovery is performed through checkpoints and copy mechanisms.

The data protection process of the check point and copy mechanism in Spark is as follows: in the task execution process, spark stores the current data processing result and execution state as check point data according to a preset period; the check point data is converted into a format suitable for storage after serialization processing, and is written into a check point directory of the HDFS; spark copies checkpoint data to different HDFS data nodes according to the setting of copy factors, and a plurality of copies are created; the duplicate data are distributed and stored on different nodes of the HDFS cluster, so that the redundancy and high availability of the data are realized; when the node fails or the task fails, the Spark automatically triggers a fault-tolerant recovery mechanism; spark reads the latest check point data from the check point directory of the HDFS, and restores the execution state of the task; spark reads the check point data from the copy of the failure node, and restores the data processing result; and according to the recovered execution state and data, the Spark continues to execute the task from the breakpoint until the task is completed.

3. Advantageous effects

Compared with the prior art, the application has the advantages that:

The script editor is adopted to develop codes, and source codes are submitted to a distributed version control system Git to be stored and managed, so that the version control and collaborative development of the codes are realized; by setting a plurality of code bins and version management engine examples, the distributed management and high availability of the code bins are realized;

And slicing the source code by adopting a program slicing method, and realizing the dependency analysis of the source code and the extraction of the code fragments by constructing a control flow graph and inserting a global variable monitoring code.

In the process of constructing the control flow graph, a recursion descent algorithm is adopted to analyze the source code, and a line segment tree is used for mapping and inquiring the code offset address, so that the efficiency of constructing the control flow graph is improved.

Metadata corresponding to the source codes are stored in a relational database and mirrored in a memory database, and data synchronization is performed by adopting an asynchronous replication mode based on log transmission, so that the persistent storage and the quick access of the metadata are realized. Through the use of the memory database, the inquiry and access performance of the metadata are improved.

And constructing a sandboxed execution engine based on a virtualization method, loading source codes from a relational database, and acquiring corresponding metadata from a memory database by utilizing a least recently used algorithm LRU, thereby realizing safe execution and resource isolation of the codes. Meanwhile, by adopting APACHE SPARK distributed computing frames, the parallelism and performance of code execution are improved;

In APACHE SPARK distributed computing frameworks, checkpoint mechanism checkpoints and data copy strategies are adopted to protect data and execution states in the execution process, so that fault tolerance and reliability of code execution are improved. By periodically saving checkpoint data and creating a copy of the data, fault recovery and data protection are achieved.

In the use of the memory database Redis, a plurality of clusters of Redis nodes are constructed, and data synchronization and backup are performed by adopting a master-slave replication mode, so that the availability and fault tolerance of the data are improved. Meanwhile, the load balancing of the data access request is carried out by using a consistent hash algorithm, so that the concurrency processing capacity of the system is improved.

Drawings

FIG. 1 is an exemplary flow diagram of a code framing method based on a front-end and back-end plug-in architecture, according to some embodiments of the present description;

FIG. 2 is an exemplary flow chart of bi-directional synchronization of code bin data according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart for slicing source code according to some embodiments of the present description;

FIG. 4 is an exemplary flow diagram of building a control flow graph according to some embodiments of the present description.

Detailed Description

The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

FIG. 1 is a block diagram of a code framework method based on a front-end and back-end plug-in architecture, according to some embodiments of the present description, including: adopting a script editor to develop codes and generating source codes; submitting the generated source code to a distributed version control system Git for storage management; slicing the source code by adopting a program slicing method; storing the sliced source codes to a relational database; storing metadata corresponding to the source codes into a relational database; mirroring the metadata in the relational database to the memory database; the relational database and the memory database adopt an asynchronous replication mode based on log transmission to carry out data synchronization; constructing a sandboxed execution engine based on a virtualization method, loading source codes from a relational database, and acquiring corresponding metadata from a memory database by utilizing a least recently used algorithm LRU; after the acquired source codes and metadata are assembled, executing the source codes in a sandboxed execution engine; wherein, the sandboxed execution engine adopts APACHE SPARK distributed computing frames; in APACHE SPARK distributed computing frameworks, checkpoint mechanism checkpoints and data copy policies are employed to protect data and execution states during execution.

The developer writes the Code using an editor (e.g., VS Code) to generate the source Code file. And submitting the source codes to a Git version control system for management. And automatically triggering a code analysis flow, and slicing the source code file by using a program slicing technology. And storing the sliced code blocks into a code table of the MySQL relational database. Metadata (e.g., file path, function name, etc.) for the source code file is stored in a metadata table. The MySQL data is synchronized to the Redis in-memory database using log transfer at regular time. Creating a virtual machine sandboxed execution environment in the Spark cluster. The Spark task loads source code data of the code table and acquires corresponding metadata from the LRU in the Redis cache. After the source code and metadata are assembled, the program is executed in a sandbox and the result is recorded. And the source code execution result is fed back and stored in MySQL.

FIG. 2 is an exemplary flow chart of bi-directional synchronization of code bin data according to some embodiments of the present description for preparing a plurality of Linux servers, each with Git installed and configured thereon. A Git repository is created on each server for hosting an independent source code item. A Git repository for item X is created on server A and a Git repository for item Y is created on server B. Each Git repository needs to be initialized, master branches created, etc. A Git version management engine instance, which may be in particular a Git daemon, is restarted on each server. The Git warehouse paths on the respective servers are configured to local Git engine instances. The Git instance can perform version management operations on the corresponding local Git repository. The developer interacts with the Git instance on the designated server through the Git protocol to read or modify the corresponding code repository. Thus, a plurality of source code items are distributed on different Git warehouse servers, and each warehouse is provided with a version management function by a local Git instance. Mirror synchronization between the Git instances can also be configured to realize synchronous backup of warehouse data. One Git instance a is deployed on code repository server a, and Git instance B is deployed on server B. And generating an SSH key on the example A, and configuring a public key into an authorized_keys file of the example B to realize password-free login. A symmetric public key configuration is also performed on instance B. Code repositories repo _a and repo _b are created on a and B, respectively. Adding remote endpoints on instance a: git remote add repo _b_minor git@B: /path/to/repo _b. Addition on example B: git remote add repo _a_minor git@A: /path/to/repo _a. Periodically, in a cron operation, execute on an instance a: git push repo _b_minor master. Periodically perform on instance B: git push repo _a_minor master. Thus, bidirectional mirror synchronization of the code repository is achieved between GETINSTANCE A and B. If any instance fails, synchronous redundant data exists on the other instance, so that the usability is improved.

Two servers are prepared, and a Git version management engine instance (such as a Git daemon) is deployed on each server. An instance on one server is configured as a Master instance (Master), and the other is configured as a standby instance (Slave). The main instance provides access service for the outside and processes the Git request of the client. The standby instance does not provide services to the outside, only synchronizes and mirrors the data of the main instance. Real-time data mirror synchronization is established between primary and backup instances. A G luster FS or other distributed file system may be used. After receiving the client request, the main instance executes corresponding operation and synchronizes the latest data with the standby instance. When the main instance fails, the standby instance provides failover and switches to external service of the main instance. The client request may transparently switch to the new master instance. After the original main instance is restored, the new main instance data can be gradually synchronized, and the new main instance data is restored to the standby instance role. Thus, the high availability of version management service is realized through the master and slave and data mirroring mechanisms. There is a code bin a and a code bin B, both running on different Git servers. On code bin a, a remote endpoint named repo _b is configured, pointing to code bin B. On code bin B, a remote endpoint named repo _a is configured, pointing to code bin a. SSH keys are generated on both end warehouses, public keys are exchanged, and password-free access is configured. On code bin a, the command is used: git push repo _b master. The master branch of code bin a will be pushed into code bin B. On code bin B, use commands: git pull repo _A master. The latest master branch content is pulled from code bin a. The push and pull commands are executed at regular time through tasks such as cron. Thus, an automated synchronization between the two code bins can be achieved. When the master branch of any warehouse is updated, it can eventually be synchronized to another warehouse. Synchronization with a short interval can also be configured to achieve near real-time code bin data mirroring.

On a code bin A server, acquiring the latest submitted record of a certain source code file through a git command: git log-1 filename. The commit time of the latest version of the file, i.e., timestamp a, can be parsed from the commit record. Meanwhile, the hash value hash A of the source code of the version can also be obtained. The time stamp a and the hash value hash a are saved to a file or database. A script is written, and the time stamp and the hash value information are read regularly. This information is sent to the code bin B over the network connection. A simple Socket connection may be chosen to be used or submitted to a message queue such as Kafka. A corresponding program is configured on the server of code bin B to receive the time stamp and hash value from code bin a. And storing the received timestamp A and the hash value hash A locally for subsequent comparison operation. And the code bin B receives the source code file time stamp A and the hash value hash A sent by the code bin A. The code bin B extracts the timestamp B and hash value hash B of the source code file in the repository. Comparing timestamp a and timestamp B, if equal, the description is the same version. And comparing the hash value hash A with the hash value hash B, and if the hash values are not equal, indicating that the source code has been modified. At this time, the code bin B needs to send the latest source code in its own warehouse to the code bin a. Code bin B may generate a patch package via the Git command: gitdiff > patch. The patch package patch. After the code bin a receives the patch package, the patch may be applied by command: git apply latch. Thereby synchronizing the modification of code bin B into code bin a.

After the code bin A receives the patch package sent by the code bin B, the method is executed: git apply latch. And updating the synchronous code bin B to realize one-way synchronization from B to A. The code bin B can also actively push and update to the code bin A periodically, and the command is similar to that of the code bin A: git push repo _A master. Therefore, the warehouses on two sides can be synchronously updated mutually, and bidirectional synchronization is realized. And leading in a main and standby mode, and adding an intermediate agent layer between the code bins A and B. Both code bins a and B are connected to the proxy with the primary node identity. The proxy is responsible for forwarding code updates between the master nodes. If the code bin A is down, the agent can detect that B is automatically switched to the master node. After A is restored, the operation is smoothly switched back to A as a main operation and B as a standby operation. Thus, through the master and slave agents, high availability of the code bins can be ensured.

And the code bin B periodically acquires a time stamp B and a hash value hash B which are submitted by the source code file of the warehouse. The timestamp B and hash value hash B are sent to code bin a. After receiving the timestamp B and the hash value hash B, the code bin a receives the hash value hash B. And extracting a time stamp A and a hash value hash A of the source code file in the code bin A. Comparing timestamp B with timestamp a, and continuing if equal. And comparing the hash value hash B with the hash A, and if the hash values are different, indicating that the source code file in the code bin B is updated newly. The code bin A uses the instruction of the gate fetch and the like to pull the latest source code of the code bin B. And merging the pulled source codes to the corresponding branch of the code bin A. This completes the source code synchronization from code bin B to code bin a. The series of synchronization operations is periodically performed at a certain frequency. Both parties can initiate synchronization to realize bidirectional synchronization.

FIG. 3 is an exemplary flow chart for slicing source code, parsing source code and constructing Abstract Syntax Trees (ASTs) according to some embodiments of the disclosure: the source code text is partitioned into Token (Token) streams using lexical analysis in the compilation theory. And analyzing the grammar of the mark stream, and identifying grammar structures in codes, such as grammar structures of sentences, expressions and the like. At the same time of parsing, an abstract syntax tree is constructed to represent the source code structure. The abstract syntax tree stores syntax information of a program in a tree structure. Each node in the tree represents a syntax structure and the child nodes represent syntax elements that make up the structure. An if statement node whose child nodes would include conditional expressions, the then block, and else block. Thus, an abstract syntax tree representation corresponding to the source code text is obtained. The subsequent processing such as semantic analysis, binding definition and reference, data stream analysis and the like can be performed on the basis of AST.

In the recursive traversal of an AST, the type of each node is determined. If the node type is an if statement, while loop, for loop, switch statement, goto statement, etc., it is marked as a control node. For an if node, its conditional expression is analyzed, and if true, its outgoing edge is connected to the first statement of the then block. If false, connect to the first statement of the else block. For the while node, its outgoing edge is connected to the first statement of the loop body. For a switch node, its outgoing edge is connected to the first statement of each case statement. For the goto node, the outgoing edge is connected to the jump target statement. And adding the determined outgoing edge to a corresponding control node. If a statement is not connected by an outgoing edge of any control node, an edge connected to the next statement is added starting from it. The process is repeated, eventually forming a control flow graph. Starting from the root node of the AST, the entire syntax tree is traversed using a depth-first or breadth-first algorithm. During traversal, for each node, its type is determined: in the case of a control node, its outgoing edge is determined according to the previous method. If the node is a common sentence node, no special processing is required. And recording the out-side information of the current node. The sub-nodes continue to be traversed until the traversal completes the entire AST. Eventually, each control node records its outgoing side information. All control nodes and sentences are taken as nodes, and control flow transfer relations are taken as edges, so that a directed graph is formed. The figure shows all possible control flow transfer paths for the whole procedure. Saving the directed graph is the complete control flow graph of the program. Each node in the CFG represents a statement or a control point in the source code. Control statements such as if/while are added to the CFG as one node. For a common code statement, it is also added to the CFG as a node. In a CFG, the outgoing edge of a node represents the next statement that may be directly transferred from the node. If there is a direct control transfer relationship between two statements, a directed edge is added between them. If there is no control transfer relationship between the two statements, but it is performed sequentially in the program, a directed edge is also added between them. Thus, the flow control relationship between the statements in the program is represented. Finally, the CFG acts as a directed graph containing all possible control flow diversion paths for the program. By adding the relationships of nodes and edges on an AST basis, a CFG graph representing the program control flow is constructed.

In constructing a CFG, each time a statement node is traversed, the variables defined and used in it are recorded. When defining variables, the variables and the sentences are marked as definition positions, and the definition set Def is added. When the variable is used, the variable and the sentence are marked as a Use position, and the Use set Use is added. Def and Use of each variable are collected, resulting in a set of statements that it defines and uses. If a definition statement for a variable only appears in the global scope, there is no function scope restriction, it is determined to be a global variable. For a node determined to be a global variable, a corresponding node in the CFG for which a definition statement is found. Attributes are added to the definition statement node, marking the definition statement as a global variable. Thus, through the data flow analysis jointly performed in constructing the CFG, the definition statement nodes of the global variable can be accurately found. In constructing an AST, a statement is defined for each variable, and the variable name it defines is recorded. After the AST construction is completed, the whole syntax tree is recursively traversed from the root node. When traversing to a variable definition statement, it is checked whether the variable defined by the statement is a global scope. If the variable is a global variable, marking the node of the variable definition statement. When constructing the CFG, each statement node in the CFG is kept in a mapping relation with the statement nodes in the AST. Statement nodes in the corresponding CFG can be found by global variable definition statements marked in AST. The CFG statement node is also marked as a definition statement for a global variable. Thus, the definition position of the global variable is found by firstly positioning the global variable definition in AST and then determining the corresponding statement node in CFG. Starting from the ingress node of the CFG, the entire control flow graph is traversed. For each node it is checked whether it defines statements for variables. If the variable is a variable definition statement, further judging whether the variable is a global variable or not. To the statement node defined for the global variable, a marker attribute is added, representing the definition statement that is the global variable. And continuing traversing the rest nodes in the CFG, and repeating the judging and marking process. Until all node traversals are completed, the definition statements of all global variables in the CFG are marked. After the CFG construction is completed, the tag attributes can be accessed directly to find the definition statements of all global variables. This provides an accurate location of the global variable definition statement in the CFG. Reference information is provided for the subsequent insertion of code that records global variable value changes prior to the definition statement.

At the time of building the CFG, statement nodes defining global variables have been identified. Traversing the CFG to find each statement node marked with the global variable definition. The predecessor node of this statement node, i.e., the previous executable statement of the definition statement, is taken. At the end of the predecessor node, a print log statement is inserted. The print log statement is in the form of: print (global variable name, value of global variable name), can print out the current name and value of the global variable. Statements are defined for each global variable in the CFG, iterating. A print statement is inserted at the end of the previous executable statement of each definition statement. Through this process, log printing is inserted before all global variable definition statements in the CFG. The source code so modified may output the change of the global variable when executed. Compiling the code after inserting the printing statement in the source code. After the compiling is passed, the executable program generated by compiling is run. When the program is executed, the name and the value of the global variable are output at the predefined position. And collecting the log printed in the whole running process of the program. The name and value of each global variable can be seen in the log. Depending on the location of the print statement, it may be determined what statement the variable value was printed before execution. In combination with the change in the variable value, it is possible to analyze when the variable value has changed. And further determining an execution path corresponding to the variable value change according to the code execution flow. Finally, the value change condition of the global variable and the control flow information of program execution can be obtained through analysis from the log.

When traversing an AST, the definition statement and the use statement of each variable are recorded. The definition sentences are put into the definition set Def and the usage sentences are put into the usage set Use. For each variable, each statement node in its Use set: from node onward, add the statement affecting it to slice in turn. When tracing back to the statement defining the variable, tracing back is stopped. The process is repeated until the Use statement of each variable is traced back to the definition statement. And finally, constructing a dependency relationship diagram between each variable and the definition sentences and the use sentences of each variable. For global variables, its definition statements and usage statements are looked up directly in the dependency graph. Thus, a dependency graph between variables is obtained through data flow analysis. In the variable dependency graph, a definition statement of a global variable is taken as a starting point. From the definition statement, slicing is performed in reverse, and all the used statements depending on the definition are traced back. When tracing back to a use sentence, taking the use sentence as an end point, taking out a slice from definition to use as a segment. For each use position of the global variable, repeat until a slice segment is acquired. Finally, a plurality of slice segments are segmented out with the definition of the global variable and each use statement as boundaries. Each slice segment forms a program slice that can be executed independently and logically. These slice segments are output, constituting a set of sliced program segments. Modular decomposition of the code is achieved by means of program slicing techniques, by means of inverse slicing and bordering the definition statement.

FIG. 4 is an exemplary flow chart for building a control flow graph according to some embodiments of the present description, defining a parse function, parameter (x), that receives a source code string and a current offset, as input parameters. In the parameter (x) function, the character of the current analysis position is acquired from the offset position offset. And judging the grammar structure possibly matched with the current character according to the grammar rule. If the matching is successful, a grammar tree node of the grammar structure is created, and the starting offset is recorded. The recursive invocation of the parameter (x) then parses the subsequent substructures according to the syntax. After the analysis of the substructure is completed, calculating an ending offset to complete the analysis of the current node. The current node is returned and the offset is updated to the next position. Finally, all grammar structures are parsed into grammar trees, and each node stores offset information of corresponding code fragments. The grammar structure of the source code and the offset address mapping are analyzed and completed through a recursion descent algorithm. Line segment tree nodes are defined, including key value pairs < grammar structure, offset address >. The insertion, deletion and query operations of the AVL tree are implemented to maintain the balanced tree structure. In parsing the source code, each syntax structure is generated: a corresponding node of the grammar structure is created. The key of node is set as grammar structure and the value is the corresponding offset address interval. The node is inserted into the AVL tree. The AVL tree guarantees query efficiency through self-balancing. When inquiring, an offset address is input, and the node comprising the offset address is searched on the AVL tree. The corresponding grammar structure can be quickly obtained from the value domain of the node. Finally, the syntax structure and offset address map are stored in the AVL line segment tree.

In the line segment tree, each node stores an offset address interval and a corresponding syntax structure. For the inquiry offset address offset, an inquiry interval [ offset, offset ] is constructed. And (3) carrying out interval inquiry on the line segment tree to find out all nodes intersected with the interval. Among the intersecting nodes, a node in which the offset address section entirely contains offset is selected. From the syntax structure field of the node, the syntax structure corresponding to the offset can be obtained. Since the interval query time complexity on the line segment tree is O (log n). And the grammar tree nodes corresponding to the offset can be rapidly positioned through line segment tree interval query. Even if the source code size is n, the query time can be controlled within log n. Traversing the grammar tree from the root node, and judging the grammar structure category of each node. If the node is a sequential structure, it is added to the sequential structure list. If the node is a judgment structure such as if-else, it is added to the judgment structure list. If the node is a loop structure, e.g., while, for, it is added to the loop structure list. And continuing recursively traversing the grammar tree, and extracting all structures. While traversing the syntax tree, an offset interval of each structure in the source code is recorded. Finally, a sequential structure list, a judging structure list and a circulating structure list are obtained. Each list contains syntax structures of the corresponding type of extract. The structure contains syntax tree nodes and source code offset information.

For the sequential structure: traversing the sentences in the sequential structure, creating new basic blocks G1, adding the sentences to G1, and for the judgment structure: acquiring a judgment condition statement, creating a basic block G2, adding the judgment condition statement, acquiring an executable statement, creating a basic block G3, adding the executable statement, and regarding a loop structure: the loop condition statement is acquired, a basic block G4 is created, and the loop condition statement is added. And acquiring a loop body statement, creating a basic block G5, adding the loop body statement, and recording an offset interval of each basic block in the source code. Finally, the different structures are mapped into one or more basic blocks, sentences in the basic blocks are sequentially executed, and the blocks are connected through conditional judgment. As each base block is mapped, its start and end offsets in the source code are recorded. For the sequential structure block G1, the sentence therein is sequentially executed, and the offset section of G1 is recorded. For the judgment structure, the conditional block G2 is executed first, G3 is executed conditionally according to the result, and the offset intervals of G2 and G3 are recorded. For the loop structure, the condition block G4 is executed first, the loop body G5 is repeatedly executed when the condition is satisfied, and the offset intervals of G4 and G5 are recorded. Adding the execution relation among basic blocks in the CFG: the statement in G1 is sequentially executed, G2 is executed, whether G3 is executed is judged, and G4 is executed, and whether G5 is repeatedly executed is judged. The sentences in each basic block are sequentially executed, and the different basic blocks are connected with each other through judgment conditions. The source code structure and the CFG basic block are mapped by using the offset interval.

When traversing the grammar tree mapping basic blocks, recording the appearance sequence of each basic block in the source code. For the decision block, it is checked whether it has two branches if-else. If there is an else branch, the decision block is marked as a conditional jump. Recording the jump target of the judgment block as the basic block corresponding to the else branch. For a loop fabric block, its loop exit conditional jumps are marked. And recording the jump target of the circulation block as a basic block corresponding to the circulation head. In constructing a CFG, one node is created for each basic block. And connecting the directed edges of the basic block nodes according to the occurrence sequence of the records. Representing the execution order between the basic blocks. Decision block and loop block for marked conditional jumps: two branch edges are added at the outlet of the basic block, and the branch edges point to different subsequent basic blocks. According to the execution sequence of the basic blocks, connecting all the basic blocks to form a frame: each basic block serves as a node and is connected with directed edges according to the execution sequence. The conditional jump branch edge is added on the frame. The whole program control flow graph is constructed in this way: the nodes represent basic blocks, the directed edges represent execution orders, and the branch edges represent conditional jumps. The control flow graph comprehensively represents: and (3) a statement sequence in the basic blocks, a control flow for executing the sequence and the conditional jump among the basic blocks. I.e., the control flow and execution logic reflecting the source code.

A database is created in MySQL or PostgreSQL for storing metadata. Creating a table containing the fields: metadata name, metadata value, creation time, etc. Database interfaces are provided for inserting, deleting, modifying, querying metadata. At code parsing and compiling, the calling interface inserts metadata such as variable names, function signatures, etc. The calling interface reads from the database when the metadata is queried and used. And PREPARED STATEMENT is used for realizing the interfaces of insertion, deletion and modification, so that the efficiency is improved. The index is used to optimize query efficiency. And the database connection is managed by adopting the connection pool, so that the connection is repeatedly utilized, and the cost is reduced. In the above manner, metadata generated by code parsing is stored using relational database persistence. Multiple Redis instances are configured as distinct nodes. And configuring a master-slave replication relationship among nodes, wherein one master node is provided with a plurality of slave nodes. The master node handles the write operation and the data is synchronized to the slave node. The slave node processes the read operation, and the read speed is improved. And using a Redis Cluster to realize automatic slicing and separating data to different nodes. The client uses a Redis Cluster interface to connect Cluster operation data. The write operation of the client is routed to the master node and the read operation may be handled at the slave node. When the master node fails, the slave node elects a new master node to ensure high availability. Through master-slave replication and clustering, the expandability and high availability of the Redis memory database are realized. And storing the hot spot metadata after the code analysis, and accelerating multi-system access.

A hash value is calculated for each node in the dis cluster as an identification of the node. The data object also computes a hash value that forms a ring with the node hash value. The data object is mapped to its nearest node in the clockwise direction. The client requests to calculate the hash value and locates the node to which the request maps. When a node is newly added, only the data mapping near the newly added node is affected. And the Tree Map is used for storing the node hash value, so that query ordering is facilitated. When the data position is calculated, the request hash value and the node hash value are compared. Due to the uniform distribution characteristic of the hash algorithm, request load balancing is achieved. And (5) utilizing the efficient query of the consistency hash to quickly locate the target node of the request. When the Redis database is created, the configuration dataset expiration policy is volatile-lru. Upon insertion of data, an expiration time TTL is specified by the dis command SETex. Redis records the last access time of data in milliseconds. And scanning the database key space by adopting a periodic deletion thread. The last access time of the key is compared with the current time. If it is greater than the expiration time TTL, the key and corresponding value are deleted. And ensuring that the expired data is cleared regularly and releasing the memory space. It is also possible to check if the value is expired when it is acquired and to return to empty directly if it is expired. And the life cycle management of different data is realized by setting different TTL time.

A JDBC connector is used to establish a connection to a relational database. The source code data in the relational database table is loaded using SPARK DATA FRAME API. The loaded DATA FRAME is registered as a temporary table of Spark SQL. A Spark connector of the Redis is used to establish a connection to the Redis cluster. Metadata stored in Redis is loaded and registered as a temporary table. The JOIN operation is used in Spark SQL to JOIN together source code data and metadata tables. Business logic operations such as filtering, aggregation, etc. are performed. The JOIN result DATA FRAME is registered as a new table. In Spark jobs, source code data and metadata are read from the new table. Code parsing, compiling, or other analysis tasks are completed. And reading the added source code data and metadata table in the Spark job. The data is distributed to different executors using Spark Context parallelized data sets. In the JVM sandbox of each executor, the distributed data is assembled. The compiler interface is invoked and the source code is compiled in the sandbox. Links the code and invokes the runtime interface execution program. And the metadata is used for parameter verification and the like in the execution process. And collecting the execution results of the executors, and returning the collected execution results to the client for submitting the job. Bin log logs are enabled in MySQL/PostgreSQL. Master-slave asynchronous replication is configured in the Redis cluster. The MySQL master bin log is asynchronously synchronized to the Redis cluster. Redis master node data changes are asynchronously copied to MySQL. And realizing the bidirectional synchronization of the relational database and the memory database.

Claims

1. A code framing method based on front-end and back-end plug-in architecture, comprising:

adopting a script editor to develop codes and generating source codes;

Submitting the generated source code to a distributed version control system Git for storage management;

slicing the source code by adopting a program slicing method;

Storing the sliced source codes to a relational database; storing metadata corresponding to the source codes into a relational database;

mirroring the metadata in the relational database to the memory database; the relational database and the memory database adopt an asynchronous replication mode based on log transmission to carry out data synchronization;

Constructing a sandboxed execution engine based on a virtualization method, loading source codes from a relational database, and acquiring corresponding metadata from a memory database by utilizing a least recently used algorithm LRU;

After the acquired source codes and metadata are assembled, executing the source codes in a sandboxed execution engine;

Wherein, the sandboxed execution engine adopts APACHE SPARK distributed computing frames; in APACHE SPARK distributed computing frames, adopting a checkpoint mechanism checkpoint and a data copy strategy to protect data and an execution state in an execution process;

Submitting the generated source code to a distributed version control system Git for storage management, wherein the method comprises the following steps:

Setting a plurality of code bins, wherein each code bin corresponds to one source code item;

Setting a plurality of version management engine examples, wherein each engine example processes version management of one code bin;

Setting a mirror image synchronization mechanism between the code bins and the code bins, and synchronizing data between the code bins;

setting a main version management engine instance and a plurality of standby version management engine instances;

the source code is submitted to a main version management engine instance to carry out version management;

The master version management engine instance synchronizes the version management log to the slave version management engine instance;

when the main version management engine instance fails, a standby version management engine instance is acquired as a new main version management engine instance;

Setting a mirror image synchronization mechanism between the code bins, and performing data synchronization between the code bins, wherein the mirror image synchronization mechanism comprises the following steps:

acquiring a time stamp A of a source code A in a code bin A;

Obtaining a hash value hash A of a current version of the code bin A;

Sending the time stamp A and the hash value hash A to a code bin B;

receiving a time stamp B and a hash value hash B of a source code B sent by a code bin B;

comparing the received time stamp B and the hash B with the acquired time stamp A and the hash A;

When the time stamp A and the time stamp B are the same, but the hash value hash A and the hash value hash B are different, judging that the source code is modified, and requesting to send the modified source code B to the code bin B;

Receiving a modified source code B sent by a code bin B;

The source code a in code bin a is updated with the received source code B.

2. The front-end and back-end plug-in architecture based code framing method of claim 1, wherein:

Slicing the source code using a program slicing method, comprising:

constructing a control flow graph of the source code through grammar analysis, wherein the control flow graph represents the execution flow of the source code;

Traversing a control flow graph, and inserting a global variable monitoring code at a position where a global scope variable appears for the first time in a source code; the global scope variable represents a variable which is defined in the global scope of the source code, is not limited by any code block or function, and is accessed or modified by any code position in the whole life cycle of program operation; the global variable monitoring code is used for recording the change of the value of the global scope variable in the source code executing process;

Executing the source code inserted with the global variable monitoring code, and acquiring value change data and control flow path data of all program variables in the source code during operation; the value change data of the program variable comprises a variable name and a variable value; the control flow path data includes branches, loops, or function calls through which the source code executes;

acquiring the dependency relationship among the variables in the source code according to the value change data of the program variables and the control flow path data;

Slicing the source code into a plurality of code segments according to the acquired dependency relationship among the variables;

And storing the sliced code segments into a relational database.

3. The front-end and back-end plug-in architecture based code framing method of claim 2, wherein:

Building a control flow graph, comprising:

analyzing the source code by adopting a recursion descent algorithm, and acquiring a grammar structure of the source code and a corresponding code offset address; the grammar structure comprises a sequence structure, a judging structure and a circulating structure; the code offset address represents the location of the syntax structure in the source code;

using the grammar structure as a key, using the code offset address as a value, mapping the source code into line segment tree nodes, and constructing a line segment tree;

acquiring corresponding tree nodes according to the code offset addresses by adopting interval inquiry of the line segment tree;

acquiring a sequence structure, a judging structure and a circulating structure in a source code according to the acquired tree nodes;

Mapping the obtained sentences in the sequence structure, the judging structure and the circulating structure into basic blocks; the basic block represents a branch-free sequential statement set and is used as a basic unit in a control flow graph;

Acquiring an execution precedence relationship and a conditional jump relationship between basic blocks; wherein, the execution precedence relationship represents the execution sequence of the basic blocks; the conditional jump relationship represents a branch jump condition between basic blocks;

and generating a control flow graph representing the source code control flow according to the acquired basic block, the execution precedence relationship and the conditional jump relationship.

4. A front-end and back-end plug-in architecture based code framing method according to claim 3, wherein:

the line segment tree is constructed by adopting a self-balancing binary search tree AVL.

5. The front-end and back-end plug-in architecture based code framing method of claim 4, wherein:

Mapping the obtained sentences in the sequence structure, the judging structure and the circulating structure into basic blocks comprises the following steps:

mapping the sentence sequence in the sequential structure into a basic block G1; the sentences in the basic block G1 are sequentially executed in sequence, and branch jumps are not included;

Converting the judging structure into two parts of judging conditions and an executing body; mapping the judging condition into a basic block G2; mapping the execution body into a basic block G3; the basic block G2 comprises an evaluation statement of the judgment condition, and an execution flow is determined according to the true or false of the judgment condition; the basic block G3 contains a sentence sequence which is selectively executed according to the determined execution flow;

converting the circulation structure into a circulation head part and a circulation body part; mapping the cyclic header into a basic block G4; mapping the cyclic body into a basic block G5; the basic block G4 contains an evaluation statement of the loop head condition, and determines whether to execute the loop according to the true or false of the loop head condition; the basic block G5 contains a sequence of statements in the loop body, which is repeated until the loop condition is false.

6. The front-end and back-end plug-in architecture based code framing method of claim 5, wherein:

the relational database adopts MySQL or PostgreSQL database;

The memory database adopts Redis.

7. The front-end and back-end plug-in architecture based code framing method of claim 6, wherein:

The memory database adopts Redis, which comprises the following steps:

Constructing a cluster of a plurality of Redis nodes; each Redis node corresponds to an independent Redis instance; data synchronization and backup are carried out between Redis nodes by adopting a master-slave replication mode;

When data access requests are sent to clusters of a plurality of Redis nodes, a consistent hash algorithm is adopted to obtain target nodes mapped by the access requests;

Different expiration time TTL of the data are set in each Redis node, and when the storage time of the data exceeds the set expiration time TTL, the corresponding data are deleted.

8. The front-end and back-end plug-in architecture based code framing method of claim 7, wherein:

in APACHE SPARK distributed computing framework, protecting data and execution state in the execution process by adopting checkpoint mechanism checkpoint and data copy policy, including:

In the execution process, storing data and an execution state as checkpoint data checkpoint according to a preset period;

creating a plurality of copies of checkpoint data checkpoint, and storing the copies in different nodes in a distributed manner;

When a node fails or a task fails, the execution state is restored from the latest checkpoint, and the data execution task is restored from the copy of the failed node.