CN113190269A - Code reconstruction method based on programming context information - Google Patents

Code reconstruction method based on programming context information Download PDF

Info

Publication number
CN113190269A
CN113190269A CN202110408445.3A CN202110408445A CN113190269A CN 113190269 A CN113190269 A CN 113190269A CN 202110408445 A CN202110408445 A CN 202110408445A CN 113190269 A CN113190269 A CN 113190269A
Authority
CN
China
Prior art keywords
vector
context
code
programming
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110408445.3A
Other languages
Chinese (zh)
Inventor
张静宣
骆君鹏
梁嘉慧
刘思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202110408445.3A priority Critical patent/CN113190269A/en
Publication of CN113190269A publication Critical patent/CN113190269A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a code reconstruction method based on programming context information, which analyzes a demand document, a design document, a defect report and a code structure of a programming project according to the context information of the programming project to form a programming context model, reconstructs a source list for a given identifier through the programming context model, and comprises the following steps: (1) segmenting and serializing the code context, the annotation, the requirement document and the design document to form a text vector; (2) converting the text vector into a numerical vector by a machine learning method, and embedding the numerical vector; (3) and for a plurality of formed vector spaces, clustering the vector spaces by using a clustering method (4) inputting identifiers into a programming context model to form a reconstructed alternative ordered list, and recommending the reconstructed alternative ordered list to a developer. The invention constructs a new reconstruction identifier more effectively and individually, improves the software quality, avoids program defects and corruption, and helps developers to develop programs.

Description

Code reconstruction method based on programming context information
Technical Field
The invention relates to a code reconstruction method, in particular to a code reconstruction method based on programming context information.
Background
For identifier reconstruction, the following two aspects are mainly embodied, namely renaming based on naming convention and renaming based on inconsistency.
Renaming based on naming convention: a naming convention is a set of rules that programmers use to guide them in naming a software entity, particularly the rules that form the identifier (including the choice of words, and also "grammar" rules). Naming software entities according to a uniform naming convention improves the readability and maintainability of software applications. Naming conventions are context dependent. Different programming languages and organizations define their own naming conventions. For example, the Java language suggests naming entities according to the camel case convention, while C programmers typically use underlining as a separator to connect words.
Finding a renaming opportunity based on a naming convention is different from the choice or quality of the naming convention itself. Implying not the naming convention itself but its violation. In theory, renaming based on a naming convention must first find the context convention used by the software, and second identify the identifier of the convention such as a violation renaming opportunity.
Renaming based on inconsistency: checking for inconsistent identifiers in software is another key technique to identify renaming opportunities. The identifiers convey domain concepts that are useful for program understanding. They capture application-specific knowledge that programmers have when writing code. The appropriate identifier should not only reflect the role of its entity in conciseness, but also represent a concept consistently throughout the program. However, most programs suffer from inconsistent identifiers.
Renaming based on inconsistencies existing approaches identify renaming opportunities based on inconsistencies between identifiers and between entity names and entity implementations. Further, the inconsistency between identifiers is classified into a synonym-based method, a clone-based method, and a similarity-based method.
The existing research work puts a major effort on the identification of the renaming opportunity, but neglects the operation of determining which part of the identifiers needs to be renamed, namely what the source of the renaming of the identifiers is, and in what way better identifiers are recommended to the developer, which part is as important as the identification of the renaming opportunity, even determines the degree of interaction with the developer and the degree of acceptance of the recommended new identifiers by the developer, and influences the effectiveness and significance of the identifier reconstruction.
At this stage, no research has been done to systematically summarize what the effective recommendation sources for identifier reconstruction are and what the effective recommendation means are. The research on the part is relatively deficient at present, and further intensive research needs to be carried out urgently.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a code based on programming context information for individually constructing a new reconstruction identifier and improving software quality.
The technical scheme is as follows: a code reconstruction method based on programming context information analyzes a demand document, a design document, a defect report and a code structure of a programming project according to the context information of the programming project, and comprises the following steps:
(1) segmenting and serializing the code context, the annotation, the requirement document and the design document to form a text vector;
(2) converting the text vector into a numerical vector by a machine learning method, and embedding the numerical vector to form a vector space;
(3) for a plurality of formed vector spaces, clustering the vector spaces by a clustering method, and classifying the same finally formed programming context models together;
(4) and forming a reconstructed alternative ranking list by programming a context model for the given identifier, and recommending the reconstructed alternative ranking list to the developer.
Further, in the step (1), before performing segmentation serialization, firstly, converting a code context into an abstract syntax tree; the code context, the annotation, the requirement document and the design document are marked as text vectors; all code contexts are parsed by the abstract syntax tree and the generated abstract syntax tree is traversed by a depth first search algorithm to obtain two kinds of tokens: one is the node type and the other is the identifier of the node; and preparing for vector embedding of the next step.
Further, in the step (2), converting the text vector generated in the segmentation serialization into a numerical vector, and performing vector embedding to form a vector space, including the steps of:
step 21, descriptive text embedding
Providing a text mark sequence through the text vector mode of the step (1):
NVname←EPV(TN)
wherein E isPVIs a paragraph vector embedding function, and a training set T of sequence marksNAs an input; the output is the name mapping function NVname:TN→VNIn which V isNIs the embedded vector space;
code context embedding, step 22
Two layers of neural networks are selected, and the model is constructed as follows:
TVB←EW(TB)
wherein E isWIs a marker sequence TBThe training set of (a) is an input label embedding function; the output is a label mapping function TVB:TWB→VBWWherein TWBIs a vocabulary of tokens, VBWIs at TWBA vector space in which the token is embedded;
after sequence embedding, the code context is finally represented as a two-dimensional vector; setting a given code context b by Tb=(t1,t2,t3...tk) Marker sequence representation, wherein ti∈TWB,i=1,2,3,....,k;VbIs corresponding to TbA two-dimensional value vector of, then VbThe following is inferred:
Vb←∫(Tb,TVB)
wherein ^ is based on the label mapping function TVBConverting the sequence of tokens of the code context into a function of a two-dimensional vector, thus Vb=(v1,v2,v3...vk)∈VB(ii) a Wherein v isi←TVB(ti),VBIs a set of two-dimensional vectors;
step 23, generating vector space
For a code context, the input is the two-dimensional vector generated in step (22), the output is a vector space corresponding to an entirety, and the mapping function is obtained by the following formula:
VVbody←EBV(VB)
wherein E isBVIs an embedded function, and context the code VBAs training data, and generates a mapping function VVbodyIt is defined as follows:
VVbody:VB→VB
wherein, VB' is the embedded vector space of the code context.
Further, in the step (3), the vector space generated in the step (2) is divided into different subsets by k-means clustering, and the partition clustering divides the data objects into non-overlapping subsets, so that each data object is in exactly one subset; finding clusters with k number specified by a user by adopting k mean value, and regarding the generated vector space as a point in a high-dimensional space to finish a clustering task; the k-means clustering operation is as follows:
first, k initial centroids are selected, where k is a user-specified parameter; each point is assigned to the nearest centroid, and the set of points assigned to one centroid is a cluster;
then, updating the centroid of each cluster according to the assigned cluster points; repeating the assigning and updating steps until the cluster does not change, or equivalently until the centroid does not change;
employing a proximity metric to quantify data, assigning points to the nearest centroid using euclidean distances for the points in space;
for data of euclidean distance, the sum of squares of the errors SSE is taken as the objective function to measure the cluster quality:
Figure BDA0003023223720000031
where dist is the standard Euclidean distance between two objects in Euclidean space; ciIs the ith cluster, k is the number of clusters, and x is the object.
Further, in the step (4), for the identifier to be reconstructed, inputting the identifier into the clustered programming context model to obtain a classification output, outputting a classification cluster label of the programming context model corresponding to the identifier, identifying all identifiers of the same programming context classification cluster label as the identifier, and calculating vector similarity between all identifiers to obtain a context environment to which the identifier belongs; and simultaneously, identifiers corresponding to other context environments similar to the identifier context environment are attached, and the corresponding identifiers form a reconstructed alternative ordered list.
Compared with the prior art, the invention has the following remarkable effects: 1. the code programming context information is utilized to construct a programming context model, so that a new reconstruction identifier is more effectively and individually constructed, the software quality is improved, the program defect and corruption are avoided, and developers are helped to develop the program; 2. inputting the identifier to be reconstructed into a programming context model, outputting a classification label as a hidden reconstruction source, and better performing renaming recommendation to a developer; 3. the concept of renaming recommendation is provided, and a code reconstruction method based on programming context information is provided.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a diagram of a generic and accurate abstract syntax tree for a code fragment according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
The programming context data is an important source of information for the identifier to correct. The existing programming context data of the identifier is not effectively utilized, and mainly comprises three types, namely programmer information, programming project information and programming environment information, wherein the programmer and the programming project are divided into historical information and field information. The information has the characteristics of diversity, mass, high speed, variability and the like. Aiming at the characteristics, the invention researches the structure, the connotation and the characteristic extraction and classification technology of the context information. Among the three types of context information, the programming items have a wide variety of context information, which is the content of the key analysis in the present invention. Aiming at the programming project context information, products such as a requirement document, a design document, a defect report, a code structure and the like of the project are analyzed, and an association relation between the identifier and the software product is constructed to form a programming context model. For a given identifier, the identifier consistent with the programming context environment of the identifier can be found out from the programming context model, the identifier consistent with the programming context environment of the identifier is listed as a reconstruction source of the identifier, a reconstruction source list which is more semantic-compliant and personalized is constructed, and the renaming recommendation is better performed for a developer, wherein the whole process of the invention is shown in FIG. 1.
The implementation of the invention comprises the following steps:
(1) segmentation serialization: and (3) carrying out segmentation serialization on the code context (if the identifier is a method name, the method body of the method for the code context), the annotation, the requirement document, the design document and the like to form a text vector.
(2) Vector embedding: text vectors are converted into numerical vectors by machine learning methods such as Paragraph vectors and CNNs (Convolutional Neural Networks), and the numerical vectors are embedded to form a Vector space.
(3) And (4) clustering output: for the formed vector spaces, a clustering method (for example, calculating the similarity of two vectors, namely, examining the Euclidean distance between two vectors in the vector space) is selected to cluster the vector spaces, namely, the finally formed same programming context model is classified together.
(4) Entering identifiers into the programming context model: for a given identifier, inputting the identifier into a programming context model, outputting a classification cluster label of the programming context model corresponding to the identifier, identifying all identifiers of the same programming context model as the identifier, forming a reconstructed alternative ordered list by calculating vector similarity among all the identifiers, and recommending the reconstructed alternative ordered list to a developer.
The detailed implementation process of the invention is as follows:
step one, segmentation serialization
Before the segmentation serialization is carried out, the code context is preprocessed, namely the code context is converted into an Abstract Syntax Tree (AST). In the generated traditional abstract syntax tree, some labels marked as SimpleName in leaf nodes can interfere with feature learning of code fragments. For example, in fig. 2Tree a the variable node list is denoted as (SimpleName, list) and the method node toArray is denoted as (SimpleName, toArray). It can be challenging to distinguish between these two nodes at the leaf nodes of the generic abstract syntax tree. For this reason, the invention refines the grammar book by subdividing the unused nodes into different node types, and the refined grammar tree is called as an accurate abstract grammar tree.
FIG. 2 is a general and accurate abstract syntax tree for a return statement. First, the exact syntax tree presents a simplified architecture. Second, it becomes easier to distinguish some different nodes with an exact abstract syntax tree than with a general syntax tree node. The nodes of String [ ] type are marked as (ArrayType, String [ ]), the Variable node list is simplified as (Variable, list), and the toArray method is simplified as (method, toArray), so that two nodes are easier to distinguish by using an accurate abstract syntax tree, and the similarity of the generated text vectors is also reduced.
In the present invention, code context, comments, requirements documents, design documents, etc. are all marked as text vectors. All code contexts are parsed by the exact abstract syntax tree and the generated exact abstract syntax tree is traversed by the Depth First Search algorithm (DFS) to obtain two kinds of tokens: one is the AST node type and the other is the identifier of the node. For example, the code "double fix" is labeled as a four-labeled text vector (PrimitiveType, double, variable, fix), and so on, all code contexts can be converted into four-labeled text vectors. For comments, requirement documents and design documents, a text vector mode (ObjectType, Function, ParaType, Output) is designed, wherein the ObjectType refers to the type of a description object, for example, a sentence of a comment describes a Function of a method, and the ObjectType is MethodType at this time, and so on; the Function is a Function to be described, for example, if a code is required to complete the operation of interchanging two numbers on a document, the Function is a swop; the ParaType is a parameter type for describing the object, and if a sentence describes addition of two integer types, the ParaType is int at the moment; output is the Output describing the object, and null if there is no Output. Through the text vector mode designed above, comments, requirement documents, design documents and the like can be well converted into text vectors. All programming contexts are converted into text vectors, and the operation of segmentation serialization is completed to prepare for the subsequent vector embedding.
Step two, vector embedding
And converting the text vector generated in the segmentation serialization into a numerical vector, and embedding the vector to form a vector space. For descriptive statements such as comments, requirement documents and design documents, they are embedded into vectors by the Paragraph Vector algorithm. For the code context, text labels are first embedded into vectors by Word2Vec technology, and then these embedded vectors are sent to the convolutional neural network, thereby embedding the whole code context into the vector space. The detailed steps are as follows:
step 21, descriptive text embedding
Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations, such as sentences and paragraphs, from variable-length text segments. Thus, vector embedding can be well continued for descriptive statements such as comments, requirements documents, and design documents. Specifically, a text markup sequence is provided by the new text vector schema set forth in step one.
NVname←EPV(TN) (1)
In the formula (1), EPVIs a paragraph vector embedding function, and a training set T of sequence marksNAs an input; the output is the name mapping function NVname:TN→VNIn which V isNIs the embedded vector space. The Paragraph Vector algorithm can embed the entire text sequence into a Vector.
Code context embedding, step 22
Word2Vec is a two-layer neural network whose main function is to embed words, converting each Word into a numerical vector. The model is constructed as follows:
TVB←EW(TB) (2)
in the formula (2), EWIs a marker sequence TBThe training set of (a) is an input label embedding function; the output is a label mapping function TVB:TWB→VBWWherein TWBIs a vocabulary of tokens, VBWIs at TWBThe vector space in which the token is embedded.
After sequence embedding, the code context is finally represented as a two-dimensional vector. Setting a given code context b by Tb=(t1,t2,t3...tk) A marker sequence representation; where ti e TWB,i=1,2,3,....,k;VbIs corresponding to TbTwo-dimensional vector of (1), then VbThe following is inferred:
Vb←∫(Tb,TVB) (3)
in formula (3), ^ is based on the label mapping function TVBFunction for converting a sequence of tokens of a code context into a two-dimensional vectorThus Vb=(v1,v2,v3...vk)∈VB(ii) a Wherein v isi←TVB(ti),VBIs a set of two-dimensional vectors.
Step 23, generating vector space
The descriptive text is a collection of words, and the Paragraph Vector algorithm can generate numerical vectors with the same length from paragraphs with different lengths, and the numerical vectors are collected into a space, so the Vector space generation of a code context (such as a method body) is taken as an example.
For the code context, an additional mapping function is required, where the input is the two-dimensional vector generated in step 22 and the output is the vector space corresponding to an entirety.
The mapping function is obtained by the following formula:
VVbody←EBV(VB) (4)
in the formula (4), EBVIs an embedding function (CNNs) that context the code VBAs training data, and generates a mapping function VVbodyIt is defined as follows:
VVbody:VB→VB′ (5)
in the formula (5), VB' is the embedded vector space of the code context.
At this point, the text vector generated in the segmentation serialization is converted into a numerical vector, and vector embedding is performed, so that a vector space is formed.
Step three, clustering output
Cluster analysis groups data objects based only on information found in the data describing the objects and their relationships. The goal is that objects within the data are similar (related) while objects in different groups are different (unrelated). The vector space generated as described above needs to be divided into different subsets (clusters) by clustering techniques. Partitional clustering may partition data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. k-means clustering is a prototype-based, partitioned clustering technique that attempts to find a user-specified number (k) of clusters. In the invention, the generated vector space is regarded as a point in a high-dimensional space, and then the task of clustering is completed by adopting k-means clustering. The correlation operation of k-means clustering is described in detail below.
First, k initial centroids are selected, where k is a user-specified parameter, i.e., the desired number of clusters. Each point is assigned to the nearest centroid and the set of points assigned to one centroid is a cluster. The centroid of each cluster is then updated according to the assigned cluster points. The assigning and updating steps are repeated until the cluster does not change, or equivalently until the centroid does not change.
To assign a point to the nearest centroid, a proximity metric is required to quantify the "nearest" notion of the data under consideration, using the euclidean distance for the points within the space.
As the centroid may vary from data proximity metric to clustering target. The objective of clustering is usually represented by an objective function that depends on the proximity of points to each other, or to the centroid of the cluster, taking into account the data whose proximity metric is the euclidean distance, taking the sum of squares of the errors SSE as the objective function for measuring the cluster quality:
Figure BDA0003023223720000071
in equation (6), dist is the standard Euclidean distance between two objects in Euclidean space, CiIs the ith cluster, k is the number of clusters, and x is the object.
Step four, inputting the identifier into the programming context model
By calculating the similarity between vectors, a plurality of dissimilar directional quantum spaces are formed after clustering the generated vector space, and different sub-spaces correspond to similar context environments. Inputting the identifiers needing to be reconstructed into the clustered programming context model to obtain classified output, wherein the obtained classified output is not only the context environment to which the identifiers belong; and simultaneously, identifiers corresponding to other context environments similar to the identifier context environment are attached, and the corresponding identifiers form a reconstruction alternative sorting table.
The method for calculating the similarity of the vectors can also calculate the similarity between the two vectors, calculate the similarity between the identifiers of the reconstruction candidates and the identifiers needing to be reconstructed, form a reconstruction ranking list and recommend the reconstruction ranking list to developers. This may better provide developers with more personalization, semantics that better conform to the identifier, and renaming of the identifier of the environment in which it is located.

Claims (5)

1. A code reconstruction method based on programming context information is characterized in that a requirement document, a design document, a defect report and a code structure of a programming project are analyzed according to the context information of the programming project, and the method comprises the following steps:
(1) segmenting and serializing the code context, the annotation, the requirement document and the design document to form a text vector;
(2) converting the text vector into a numerical vector by a machine learning method, and embedding the numerical vector to form a vector space;
(3) for a plurality of formed vector spaces, clustering the vector spaces by a clustering method, and classifying the same finally formed programming context models together;
(4) and forming a reconstructed alternative ranking list by programming a context model for the given identifier, and recommending the reconstructed alternative ranking list to the developer.
2. The method according to claim 1, wherein in step (1), before performing segmentation serialization, the code context is first converted into an abstract syntax tree; the code context, the annotation, the requirement document and the design document are marked as text vectors; all code contexts are parsed by the abstract syntax tree and the generated abstract syntax tree is traversed by a depth first search algorithm to obtain two kinds of tokens: one is the node type and the other is the identifier of the node.
3. The method according to claim 1, wherein in the step (2), the text vector generated in the segmentation serialization is converted into a numerical vector, and vector embedding is performed to form a vector space, and the method comprises the following steps:
step 21, descriptive text embedding
Providing a text mark sequence through the text vector mode of the step (1):
NVname←EPV(TN)
wherein E isPVIs a paragraph vector embedding function, and a training set T of sequence marksNAs an input; the output is the name mapping function NVname:TN→VNIn which V isNIs the embedded vector space;
code context embedding, step 22
Two layers of neural networks are selected, and the model is constructed as follows:
TVB←EW(TB)
wherein E isWIs a marker sequence TBThe training set of (a) is an input label embedding function; the output is a label mapping function TVB:TWB→VBWWherein TWBIs a vocabulary of tokens, VBWIs at TWBA vector space in which the token is embedded;
after sequence embedding, the code context is finally represented as a two-dimensional vector; setting a given code context b by Tb=(t1,t2,t3...,tk) Marker sequence representation, wherein ti∈TWB,i=1,2,3,....,k;VbIs corresponding to TbA two-dimensional value vector of, then VbThe following is inferred:
Vb←∫(Tb,TVB)
wherein ^ is based on the label mapping function TVBConverting the sequence of tokens of the code context into a function of a two-dimensional vector, thus Vb=(v1,v2,v3...vk)∈VB(ii) a Wherein v isi←TVB(ti),VBIs a set of two-dimensional vectors;
step 23, generating vector space
For a code context, the input is the two-dimensional vector generated in step (22), the output is a vector space corresponding to an entirety, and the mapping function is obtained by the following formula:
VVbody←EBV(VB)
wherein E isBVIs an embedded function, and context the code VBAs training data, and generates a mapping function VVbodyIt is defined as follows:
VVbody:VB→VB
wherein, VB' is the embedded vector space of the code context.
4. The method of claim 1, wherein in step (3), the vector space generated in step (2) is divided into different subsets by k-means clustering, and the partitional clustering divides the data objects into non-overlapping subsets such that each data object is in exactly one subset; finding clusters with k number specified by a user by adopting k mean value, and regarding the generated vector space as a point in a high-dimensional space to finish a clustering task; the k-means clustering operation is as follows:
first, k initial centroids are selected, where k is a user-specified parameter; each point is assigned to the nearest centroid, and the set of points assigned to one centroid is a cluster;
then, updating the centroid of each cluster according to the assigned cluster points; repeating the assigning and updating steps until the cluster does not change, or equivalently until the centroid does not change;
employing a proximity metric to quantify data, assigning points to the nearest centroid using euclidean distances for the points in space;
for data of euclidean distance, the sum of squares of the errors SSE is taken as the objective function to measure the cluster quality:
Figure FDA0003023223710000021
where dist is the standard Euclidean distance between two objects in Euclidean space; ciIs the ith cluster, k is the number of clusters, and x is the object.
5. The code restructuring method based on the context information of claim 1, wherein in step (4), for an identifier to be restructured, the identifier is input into the clustered context model to obtain a classification output, a classification cluster label of the programming context model corresponding to the identifier is output, all identifiers of the same classification cluster label of the programming context are identified, and then the context environment to which the identifier belongs is obtained by calculating the vector similarity between all identifiers; and simultaneously, identifiers corresponding to other context environments similar to the identifier context environment are attached, and the corresponding identifiers form a reconstructed alternative ordered list.
CN202110408445.3A 2021-04-16 2021-04-16 Code reconstruction method based on programming context information Pending CN113190269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110408445.3A CN113190269A (en) 2021-04-16 2021-04-16 Code reconstruction method based on programming context information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110408445.3A CN113190269A (en) 2021-04-16 2021-04-16 Code reconstruction method based on programming context information

Publications (1)

Publication Number Publication Date
CN113190269A true CN113190269A (en) 2021-07-30

Family

ID=76977269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110408445.3A Pending CN113190269A (en) 2021-04-16 2021-04-16 Code reconstruction method based on programming context information

Country Status (1)

Country Link
CN (1) CN113190269A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521133A (en) * 2023-06-02 2023-08-01 北京比瓴科技有限公司 Software function safety requirement analysis method, device, equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559025A (en) * 2013-10-21 2014-02-05 沈阳建筑大学 Software refactoring method through clustering
CN111222847A (en) * 2019-12-29 2020-06-02 东南大学 Open-source community developer recommendation method based on deep learning and unsupervised clustering
CN111832289A (en) * 2020-07-13 2020-10-27 重庆大学 Service discovery method based on clustering and Gaussian LDA
CN112417289A (en) * 2020-11-29 2021-02-26 中国科学院电子学研究所苏州研究院 Information intelligent recommendation method based on deep clustering

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559025A (en) * 2013-10-21 2014-02-05 沈阳建筑大学 Software refactoring method through clustering
CN111222847A (en) * 2019-12-29 2020-06-02 东南大学 Open-source community developer recommendation method based on deep learning and unsupervised clustering
CN111832289A (en) * 2020-07-13 2020-10-27 重庆大学 Service discovery method based on clustering and Gaussian LDA
CN112417289A (en) * 2020-11-29 2021-02-26 中国科学院电子学研究所苏州研究院 Information intelligent recommendation method based on deep clustering

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GABRIELA SERBAN ET AL.: "A New k-means Based Clustering Algorithm in Aspect Mining", 《PROCEEDINGS OF THE EIGHTH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC"06)》 *
JINGXUAN ZHANG ET AL.: "Exploring the Characteristics of Identifiers: A Large-Scale Empirical Study on 5,000 Open Source Projects", 《10.1109/ACCESS.2020.3013694》 *
KUI LIU ET AL.: "Learning to Spot and Refactor Inconsistent Method Names", 《2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE)》 *
WHETHER-OR-NOT: "机器学习——"物以类聚,人以群分"之聚类分析(层次聚类,K-means)", 《HTTPS://BLOG.CSDN.NET/WEIXIN_43462348/ARTICLE/DETAILS/102721496》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521133A (en) * 2023-06-02 2023-08-01 北京比瓴科技有限公司 Software function safety requirement analysis method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
Dinh et al. Clustering mixed numerical and categorical data with missing values
Liu et al. Learning to spot and refactor inconsistent method names
Buratti et al. Exploring software naturalness through neural language models
CN109446338B (en) Neural network-based drug disease relation classification method
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
US7801924B2 (en) Decision tree construction via frequent predictive itemsets and best attribute splits
CN109726120B (en) Software defect confirmation method based on machine learning
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN105122208A (en) Source program analysis system, source program analysis method, and recording medium on which program is recorded
CN114913938B (en) Small molecule generation method, equipment and medium based on pharmacophore model
CN115438709A (en) Code similarity detection method based on code attribute graph
Ilyas et al. Extracting syntactical patterns from databases
CN101576850A (en) Method for testing improved host-oriented embedded software white box
Bogatu et al. Towards automatic data format transformations: data wrangling at scale
Machanavajjhala et al. Collective extraction from heterogeneous web lists
Babur et al. Towards statistical comparison and analysis of models
Bogatu et al. Towards automatic data format transformations: Data wrangling at scale
CN113190269A (en) Code reconstruction method based on programming context information
US20220156271A1 (en) Systems and methods for determining the probability of an invention being granted a patent
CN115713970A (en) Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network
Efremova et al. A hybrid disambiguation measure for inaccurate cultural heritage data
CN111045716A (en) Related patch recommendation method based on heterogeneous data
Bui Efficient framework for learning code representations through semantic-preserving program transformations
Seca et al. Hierarchical Qualitative Clustering: clustering mixed datasets with critical qualitative information
Chuang et al. Integrating web query results: holistic schema matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210730

RJ01 Rejection of invention patent application after publication