CN110928764B

CN110928764B - Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium

Info

Publication number: CN110928764B
Application number: CN201910957929.6A
Authority: CN
Inventors: 姚奕; 刘语婵; 刘佳洛; 顾晓东; 杨帆; 陈文科; 刘伟豪
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-08-11
Anticipated expiration: 2039-10-10
Also published as: CN110928764A

Abstract

The application discloses a mobile application crowdsourcing test report automatic evaluation method and a computer storage medium, wherein the method comprises the following steps: 1) Inputting a test report set and the historical credibility of workers, eliminating invalid test reports, and performing word segmentation and disabling word removal processing on the rest test report sets; 2) Clustering the test reports according to the discovered defects to form a plurality of types of defect test reports, and selecting the grade with the largest proportion as the defect grade of the type of defect test report; 3) Constructing a plurality of normalization indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normalization score of the test report; 4) And obtaining the final score of the test report according to the defect grade and the normalization score. The application solves the defect that the existing crowdsourcing test report quality evaluation method lacks test report content quality evaluation, and improves the overall performance of the crowdsourcing test platform.

Description

Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium

Technical Field

The present application relates to an automated evaluation method and a computer storage medium, and more particularly, to an automated evaluation method and a computer storage medium for crowd-sourced test reports of mobile applications.

Background

The mobile application crowdsourcing test works by distributing the test tasks of mobile application software executed by staff in the past to anonymous network users for testing through the Internet. Due to improvement of defect discovery efficiency caused by personnel diversity, complementarity and the like in a crowdsourcing test mode, crowdsourcing tests are widely focused in the industry, and many crowdsourcing test commercial platforms (such as Applause, ***MTC, moocTest, testin and the like) are developed. The work mode of crowdsourcing testing can help task demanders obtain a large number of free workers, and the actual problem is solved by utilizing the wisdom of the workers. However, some malicious workers do not carefully work in the test process in order to pursue the benefit maximization of the malicious workers, the submitted test result is low in quality, the test quality is difficult to be effectively ensured, and serious loss can be caused to task demanders. To address this problem, many researchers have started from test reports. Some studies attempt to reduce the cost of manual review by reducing the number of test reports reviewed, propose a crowdsourcing test report priority, by prioritizing test reports with text information and screenshot information, helping developers to be able to detect as many test reports that reveal different defects as possible within limited resources and time. Still other studies have also addressed the problem of fuzzy clustering of crowd-sourced test reports, by dividing the test reports into clusters by an automated method, the developer only needs to review one representative test report in each cluster, greatly reducing the number of test reports reviewed.

However, these studies above neglect the impact of test report quality on human review efficiency. The review process cannot be separated from the manual technology, so that automatic review is realized. In response to this problem, there are also currently crowdsourcing test report quality assessment frameworks to automatically simulate the quality of test reports, which typically measure the desired characteristics or attributes in defect reports and demand specifications by defining a series of quantifiable indicators to achieve quality assessment. First, the crowdsourcing test report is preprocessed using NLP techniques. The framework then defines a series of quantifiable metrics to measure the desired attributes of the test reports and determines a numeric value for each metric based on the text content of each test report. Finally, the numerical model value of each index is converted into a nominal model value (i.e., good, bad) by using a step-by-step transformation function, and the nominal values of all indexes are aggregated to predict the quality of the test report.

Although existing crowdsourcing test report quality assessment frameworks can measure the quality of test reports from different aspects to some extent by quantifiable metrics, the quality assessment content is still limited to the format specification of the test report description information and does not involve specific flaws in the report content. Therefore, the evaluation framework is only used for evaluating the normalization of the test report description information, and the evaluation of the quality of the test result is lacked.

Disclosure of Invention

The application aims to: the application aims to solve the technical problems of providing an automatic evaluation method for a crowdsourcing test report of a mobile application and a computer storage medium, which solve the defect that the conventional evaluation method for the quality of the crowdsourcing test report lacks the quality evaluation of the content of the test report, evaluate the performance of workers from the grade of the found defect content and the standardability of the test report, accurately measure the task completion quality of the workers, remove the workers with diffuse attitudes or malicious gold digging, finish the tasks with high quality by the workers with excellent incentive property, and improve the overall performance of a crowdsourcing test platform.

The technical scheme is as follows: the application discloses an automated evaluation method for a crowdsourcing test report of a mobile application, which comprises the following steps:

(1) Inputting a test report set and the historical credibility of workers, eliminating invalid test reports, and performing word segmentation and disabling word removal processing on the rest test report sets;

(2) Clustering the test reports processed in the step (1) according to the discovered defects to form a plurality of types of defect test reports, setting the historical credibility of workers as the grade weight of the defects, weighting, and selecting the grade with the largest proportion as the defect grade of the type of defect test report;

(3) Constructing a plurality of normalization indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normalization score of the test report;

(4) Obtaining a final score of the test report according to the defect grade in the step (2) and the normalization score in the step (3);

wherein, the steps (2) and (3) are executed in no sequence.

Further, the method for eliminating the invalid test report in the step (1) comprises the following steps:

if the text length of the defect description information of the test report is less than or equal to 4, rejecting the test report; if the regular match of the defect description information of the test report contains ([ A ] [ P ])| ([ N ] [ O ])| ([ N ] [ D ].

Further, the clustering method in the step (2) according to the found defects comprises the following steps:

(1) Calculating TF-IDF values of the test report through a TF-IDF algorithm;

(2) Taking tf-idf values of all test reports as clustered data objects O _n ＝{x ₁ ,x ₂ ,...,x _n Clustering, where n is the number of data objects.

Further, the clustering is completed through an MMDBK algorithm, and the specific steps are as follows:

(1) From data object O _n ＝{x ₁ ,x ₂ ,...,x _n Selecting two objects with the farthest distance from the two objects;

(2) Finding out all objects with the distance from the clustering center smaller than a threshold d through neighbor searching, adding the objects into the proximity class of the center, and recalculating the center of the proximity class;

(3) Calculating DBI clustering indexes;

(4) Judging whether DBI cluster indexes are minimum, if not, repeating the step (2) and the step (3), if so, stopping circulation, and classifying the rest data objects into the nearest classes;

(5) And outputting a clustering result.

Further, the normalization index in the step (3) includes: text length, readability, action words, object words, negation words, ambiguous words, and interface elements;

the text length index and the readability index correspond to a convex expansion metric function, and the convex expansion metric function is that when the index is x ₂ And x ₃ The index is good when the index is x ₁ And x ₂ Between or at x ₃ And x ₄ The index is a middle rating when the index is smaller than x ₁ Or greater than x ₄ The index is the difference, x ₁ 、x ₂ 、x ₃ 、x ₄ Setting parameters;

the action word index and the interface element index correspond to an increased expansion metric function, wherein the increased expansion metric function is characterized in that when the index is smaller than or equal to 1, the index is poor, when the index is larger than 1 and smaller than or equal to 2, the index is medium, and when the index is larger than 2, the index is good;

the object word index and the negative word index correspond to a convex measurement function, wherein the convex measurement function is characterized in that when the index is smaller than or equal to 1 or larger than 2, the index is poor evaluation, and when the index is larger than or equal to 1 and smaller than or equal to 2, the index is good evaluation;

the fuzzy word index corresponds to a descending type expansion metric function, wherein the descending type expansion metric function is good when the index is smaller than or equal to 1, medium evaluation is achieved when the index is larger than or equal to 1 and smaller than or equal to 2, and poor evaluation is achieved when the index is larger than 2.

Further, the conversion method of the normalization score comprises the following steps:

and sorting all the test reports according to the number sequence of the good scores, the medium scores and the bad scores, wherein the normalization score of all the good scores of 7 indexes is max, the normalization score of all the bad scores of 7 indexes is min, and the normalization score of the i-th test report from low to high is (max-min) i/32+min.

Further, the final score in the step (4) is calculated by the following method:

final score = 0.7 defect grade +0.3 normalization score.

The computer storage medium of the present application has a computer program stored thereon, characterized in that: which when executed by a computer processor, implements the method described in the present application.

The beneficial effects are that: the application provides an automatic evaluation method for a mobile application crowdsourcing test report based on a clustering algorithm and normalization measurement, which can realize automatic evaluation of the crowdsourcing test report from both report content and report normalization, is not limited to format normalization factors of the test report, and also considers specific defect levels in the report content to comprehensively evaluate the normalization and quality of test results in the test report, thereby ensuring that a crowdsourcing test platform objectively measures the quality of a crowdsourcing test worker completing a test task from multiple aspects, realizing automatic calculation in the whole evaluation process, leading the evaluation process to deviate from manual technology, and greatly improving the efficiency of examining the test report.

Drawings

FIG. 1 is an overall flow chart of the method in this embodiment;

FIG. 2 is a graph of four metric functions of a normalization index;

FIG. 3 is a schematic diagram of four extended metric functions of a normalization index.

Detailed Description

The specific implementation mode of the application firstly preprocesses the crowdsourcing test report, and then starts from the two aspects of the severity of the defect and the standardability of the test report, on one hand, the quantity of the found defect types is determined through clustering the MMDBK algorithm of the test report, and the severity level of each type is evaluated according to the historical credibility weight of workers, on the other hand, the standardability score of each test report is calculated through report description standardability measurement index and discrete measurement function, and finally the score finally obtained by each test report is calculated by combining the two. The flow of the method of the embodiment of the application is shown in figure 1, and the steps of inputting a crowd-sourced test report set TR and a worker historical credibility GU for mobile application are as follows:

step1, inputting a test report set and the historical credibility of workers, removing invalid test reports through filtering rules, directly evaluating the reports as 0 score, and performing word segmentation and disabling word processing on the rest valid reports.

Due to the inertia and availability of people, some workers are considered to reduce the effort to gain more benefit. The algorithm therefore needs to filter out these bad or invalid test reports to ensure the quality of the test report at the later report analysis. By analyzing the multiple test report data sets, it is found that some special test reports exist in the test report sets, and the following 2 special sentences exist in the description information field of the Bug to be analyzed: (1) short sentence: the tag description is blank or contains only a few words, without any readability description. Only a few test steps are described; (2) pseudo sentence: in the Bug description column, the test case is mainly explained to be successfully executed, no defect exists, and the test is passed. Test reports containing one of the above two are invalid test reports, which do not contain any information about software defects. For example, a tag description field of null or a tag description field of length 1, means that the test report is meaningless. The test report Bug description field is "no Bug found", indicating that the test passed. Therefore, in order to increase the processing efficiency of the test report data set, it is necessary to filter out these invalid test reports before processing them.

Through analysis and induction, on one hand, the test report description information is too little, and can be directly filtered. On the other hand, the description information of the test report including the pseudo sentence is often constituted by the statement sentence, and it is found that the description information generally includes a special character string such as "test pass", "execution success", "not found", "no Bug", "not found a defect", or the like by sampling by 10%. From this, the following two filtering rules can be summarized:

(1) If the text length of the Bug description is less than or equal to 4, filtering the test report;

(2) If the regular matching of the tag's descriptive information can contain ([ A ] [ P ])| ([ N ] [ O ])| ([ N ] [ D ]. Wherein A is a behavior word, P is a positive word, N is a negative word, O is an object word, D is a motion word, Q is a number of words, and specific words contained in the words are shown in Table 1.

TABLE 1 regular statement vocabulary

After filtering out invalid test reports, the test reports need to be further preprocessed, and since the test reports are composed of Chinese natural language, NLP technology is adopted to process the test reports, and the test reports mainly comprise word segmentation and stop word removal, wherein the stop word removal is to remove stop words, namely nonsensical words in the report are removed. In view of the difficulty of Chinese character segmentation, the process is carried out herein by means of the Chinese NLP tool NLPIR segmentation tool specialized in python.

And 2, working out word frequency and inverse text frequency indexes of an effective report set by using a TF-IDF algorithm, using an MMDBK (Max-Min and Davies-Bouldin Index based K-means) clustering algorithm as an index object of clustering, and clustering test report contents, wherein the same defect is reported by the same test report.

Because the test reports are in one-to-one correspondence with the found defects, the method adopts a method of directly clustering the test reports, calculates the TF-IDF value of the test report after pretreatment through a TF-IDF algorithm, the TF-IDF value represents word frequency and inverse text frequency index, and takes the TF-IDF value of all the test reports as a clustering data object O _n ＝{x ₁ ,x ₂ ,...,x _n And clustering the test reports by using an MMDBK algorithm.

The MMDBK algorithm is a clustering algorithm improved for the defects of the K-means algorithm, the determination of the number K of clusters in the K-means algorithm and the selection of K cluster centers are improved, the optimal cluster number is determined and a new cluster center is selected by using a Davies-Bouldin Index (abbreviated as DBI) clustering Index and a maximum minimum distance method, so that smaller similarity among the classes is ensured, and the implementation flow is as follows:

step1: from n data objects O _n ＝{x ₁ ,x ₂ ,...,x _n Two objects x furthest apart are selected ₁ And x ₂ The method comprises the steps of carrying out a first treatment on the surface of the In order to avoid the problem of fuzzy clustering boundaries caused by too close of K-means algorithm in selecting a clustering center, the method selects two objects with farthest distances to form two initial clusters so as to ensure lower similarity between different classes in subsequent calculation.

step2: all objects with a distance from the cluster center less than the threshold d are found by neighbor searching and added into the proximity class of the center, and the center of the proximity class is recalculated.

Find the initialAfter the two cluster centers of the (a) are updated, determining the number of clusters and finding out the rest K-2 cluster centers, so that the similarity among the cluster centers is as low as possible, which is the key of the algorithm, specifically: known c ₁ Two initial clustering centers respectively calculate the residual objects to c ₁ And c ₂ Distance D of (2) _j1 And D _j2 ，c ₂ If D _k ＝max{min(D _j1 ,D _j2 ) J=1, 2,.. _k ＞θ·D ₁₂ ，D ₁₂ C is ₁ And c ₂ Is x _j C is the 3 rd cluster center ₃ ＝x _j . If c ₃ If present, calculate D _k ＝max{min(D _j1 ,D _j2 ,D _j3 ) J=1, 2,..n, if D _k ＞θ·D ₁₂ Finding a 4 th clustering center; push in this way until D _k ＞θ·D ₁₂ If not, ending the searching of the clustering center.

step3: calculating an updated DBI value, comparing the updated DBI value with the DBI value of the previous round (comparing the first round with the initial value), and if the updated DBI value is smaller than or equal to the value of the previous round, conforming to the circulation condition

The clustering result was evaluated by using the Davies-Bouldin Index (DBI) clustering Index. The DBI clustering index is a non-fuzzy cluster evaluation index, and mainly takes two factors of the separation degree between classes and the cohesion degree in the classes as the basis, namely the dissimilarity between different classes is high, and the similarity of data objects in the classes is high. When the distance between the data objects in the class is smaller and the distance between the data objects is larger, the DBI value is smaller, and the clustering result under the clustering number is indicated to be optimal. The calculation formulas of the inter-class distance and the intra-class distance are as follows:

d _i,j ＝||v _i -v _j || (1)

where x represents a data object within the ith class, v _i Represents the centroid of the ith class, C _i Representing the number of data objects within the ith classThe expression of the term "Euclidean distance" S _i Each data object in the ith class and centroid v _i Standard error of d _i,j Representing the Euclidean distance between the centroid of the ith class and the jth class. The Davies-Bouldin Index formula is:

wherein S is _i Represents the intra-class distance of the ith class, S _j Represents the intra-class distance, d, of the j-th class _i,j The distance between i and j classes is represented, and K represents the number of clusters.

The good clustering result should be that the intra-class distance between the same class is small, the inter-class distance is large, the condition can be met, namely, the smaller the numerator is, the larger the denominator is, the smaller the DBI value is, namely, the optimal clustering number can be obtained through the value.

step4: if the conditions are met, searching a new clustering center, and repeating the step (2-3);

step5: if the condition is not met, stopping circulation, and classifying the rest data objects into the nearest classes;

step6: and outputting a clustering result.

After clustering, each class can be regarded as a test report set of a certain defect of the software to be tested, and the test report set is evaluated according to the level of the test report filled by a worker aiming at the defect. And (3) giving priority to the test report of the worker with high test history credibility, carrying out normalization processing according to the historical credibility of the platform participator, respectively calculating lighter, general, serious and deadly proportion coefficients of the defects in the test report set as weight coefficients, and selecting one defect grade with the highest proportion as the final grade of the defects.

The magnitude of the impact is defined as the severity of the software defect and is summarized in the following four classes:

(1) The weight is lighter: some small defects such as wrongly written characters, poor typesetting of characters and the like have little influence on functions, and can be used normally.

(2) Generally: less serious errors such as the minor functional parts not being implemented, poor user interface and long response time, etc.

(3) Serious: meaning that the primary function module is not implemented, the primary function is partially lost, and the secondary function is totally lost.

(4) Deadly: refers to the ability to cause system crashes, or data loss, complete loss of primary functions, etc.

The whole test report data set is clustered into N types of defects, namely a test report set Cla respectively ₁ 、Cla ₂ 、……、Cla _N Each class contains 4 classes of test reports, where each class of class contains m _i,j (j=1, 2,3,4 represent lighter, general, severe, crashed test reports, respectively. Cla (Cla) _i,j,k The kth test report representing the jth level in the ith defect, the corresponding report submitted for the worker's historical credibility U _i,j,k The expression "TR" refers to the contribution of the submitter of the kth test report of the jth stage, and is the number of all test reports, i.e., the number of workers. I standard normalize the contribution degree, and calculate the average value of all the user historical credibility as follows:

the variance of the user history contribution value is:

after normalization, the method comprises the following steps:

for each type of defect, calculating the proportion coefficient of the defect which is lighter, general, serious and fatal

Where δ is the test report coefficient, typically set to 0.8.

We select B _i ＝max(B _i,1 ,B _i,2 ,B _i,3 B _i,4 ) The class to which the defect belongs is regarded as the class of the i-th defect. Meanwhile, the value ratio of each grade is set, and the grades are deadly, serious, general and lighter and are respectively 10, 7.5, 5 and 2.5.

And 3, setting the historical credibility of the existing workers as a test report grade weight for each class of test report, determining the proportion of each grade by combining the number of defects of each class, selecting the grade with the largest proportion as the final grade of the defects, and setting the grade to obtain the corresponding defect grade score.

And 4, starting from report description normalization, defining 7 quantifiable normalization indexes and corresponding step type measurement functions, converting the number values of the indexes into quality levels through the measurement functions, and obtaining corresponding normalization scores by using a linear interpolation method according to the number of the quality 'good', 'medium', 'bad' of the 7 indexes.

The normalization of the test report reflects the capability and attitude of crowdsourcing test workers to complete tasks, and is one of factors for evaluating the test report of the workers, so in order to evaluate the normalization of the test report more accurately, 7 quantifiable indexes are constructed to measure the quality of the test report, and defect description information or test steps are mainly evaluated:

(1) Text length: the text length refers here to the number of Chinese characters contained in the defect description information in the test report, and the test report with the text length kept at a proper value is better in quality.

(2) Readability: the reading difficulty of the text can be measured, and the measurement formula is Y= 14.9596X ₁ +39.07746X ₂ -2.48X ₃ Wherein X is ₁ Representing the proportion of difficult words, X ₂ Representing the number of sentences, X ₃ Representing the average stroke number of the Chinese character.

(3) Action word: in describing defects, a sequence of actions is often described in the test steps, which are critical to the interface or interface event of the test worker triggering the software. It is therefore necessary to pay attention to action words in the test report such as "open", "click", "exit", and the like.

(4) Object words: when test workers find a flaw in the software, they describe this behavior in terms of something that can represent a systematic error, such as "problem", "flaw", "Bug", etc.

(5) Negative words: when test workers find a defect in the software, they will use negative words to describe the lack of system functionality, such as "lack", "failure", etc.

(6) Ambiguous words: during testing, a tester may prefer to use ambiguous or ambiguous words to describe if he encounters an ambiguous or ambiguous defect, which may present difficulties in understanding the test report. The ambiguous words are "near", "few", "possible", "general", etc.

(7) Interface element: the mobile application software interface is composed of a plurality of interaction components, clicking, inputting, sliding and other operations need to be carried out on the corresponding components during software testing, and interface elements such as 'buttons', 'sliding bars' and the like are necessarily contained when describing action sequences in the steps of describing testing.

Considering that the length of the test step is determined according to the complexity of software, a proper text length cannot be used for measurement, so that the text length is used for individually quantifying the defect description information for the word segmentation text which is already preprocessed, and the defect description information and the test step are jointly quantified by the remaining 6 indexes. Finally, a 7-dimensional index vector is generated for each test report, and each test report can be represented by a 7-dimensional vector. The quality values evaluated are denoted as "good", "medium", "bad", and a metric function is constructed to convert the continuous values into discrete values.

The metric functions are divided into four types: a growing metric function, a falling metric function, a convex metric function, a concave metric function, as shown in fig. 2:

(1) Growth metric function: when the index is smaller than x ₁ The index is then the differenceWhen the index is greater than x ₁ The index is good.

(2) Drop-down metric function: when the index is smaller than x ₁ The index is good, when the index is larger than x ₁ The index is a bad evaluation.

(3) Convex metrology function: when the index is x ₁ And x ₂ The index is good when the index is smaller than x ₁ Or greater than x ₂ The index is a bad evaluation.

(4) Concave metric function: when the index is x ₁ And x ₂ The index is evaluated as the difference when the index is smaller than x ₁ Or greater than x ₂ The index is good. However, the above measurement function can only divide two discrete values of good and bad, and it is necessary to expand the measurement function and increase the boundary parameter x ₂ 、x ₃ 、x ₄ The expanded measurement function can be used for dividing three discrete values of good, medium and difference, and the schematic diagram is shown in fig. 3.

The type of the metric function and the parameter setting of the function corresponding to 7 indexes in table 2 are represented by using 0-a-b- ≡form for the growing type and the descending type metric function, and are represented as three sections. For both convex and concave metric functions, the parameter interval is represented using the 0-a-b-c-d- ≡form, expressed as five intervals.

Table 2 metric function of evaluation index

In test reports, the most suitable length of the text should be 15-30, and text that is too long or too short affects the quality of the test report, so that the test report is evaluated by a convex expansion metric function, and the readability is also the same, and specific parameters are obtained through experimental debugging. The object words and negatives are represented by explicit numbers, and if the text contains more than 0 or 2 object words or negatives, the test report is considered to be of low quality, and if there are only 1 misdirected words or negatives, it is considered to be of high quality, here according to the convex metric function of the prototype. Action words and interface elements are measured using an incremental expansion metric function, i.e., if more words are included, the better the quality of the test report on the index is considered. Only the ambiguous word index belongs to the decreasing type expansion metric function, and the more ambiguous words, the worse the test report quality.

The parameters set by the measurement function are classified, so that 'good evaluation', 'medium evaluation' and 'bad evaluation' of the test report can be obtained according to the indexes, and in order to summarize the evaluation of a plurality of indexes, the three different evaluation correspond to different score grades, and the invalid test report is set to be 0 score; the optimal test report is that the obtained 7 'good scores' are full 10 points; the worst effective test report, 7 "bad scores" were set to a bottom score of 1, with the middle scores evenly distributed. I.e. the method of converting the 7 index quality into a normalized score is a linear interpolation method. The quality of 5 indexes can be "good", "medium", "bad", the quality of 2 indexes can be "good" and "bad", and finally the total number of 7 indexes "good", "medium", "bad" is combined, and there are 33 quality evaluation results in total. The scores were determined for the 33 results using linear interpolation, with the highest 7 good scores being max, here 10, and the lowest 7 difference scores being min, here 1, and the scores for the results at the i-th bit from low to high of all quality assessment results being (max-min) ×i/32+min. The quality scores that can ultimately result in the normalization of the test report are shown in table 3.

Table 3 normalization score for test report

And step5, obtaining a final score of the test report through weighted summation of the defect grade and the normalization score. The defect grade score weight is 0.7 and the normalization score weight is 0.3.

The pseudo code of the program implementation of the method is as follows:

input: test report set, worker historical credibility

CTRAEA(TR，GU)

1 for i in range (n)// pretreatment stage

2 if TR _i Meets the filtering rule

3 mTR _i ＝0，delete TR _i The/(invalidation report is scored as 0 point and culled

4 statistics of the number of invalid reports// the number evaluates the accuracy of the filtering rules

5 newtr=split (TR)// word for all valid reports, decommissioning word

6 CN = Cluster (newTR)// cluster rating defect rating

7 for i in range (N) traversing each type of defect

8 for j in range (4)// traversing each level of such defects

9 for k in range(m _i,j ) All test reports traversing this hierarchy

10

11 B _i ＝max(B _i,1 ,B _i,2 ,B _i,3 B _i,4 ) Determining defect level score

12 DG _i ＝ratio(B _i ) Determining defect level score

13 for i in range (m)/test report normalization metrics

14 ZB _i ＝newTERQAF(newTR _i )

15 QG _i ＝search(ZB _i )

16 Rw _i ＝a*DG _i +(1-a)*QG _i Weighted sum of defect-level score and normative score, a is typically 0.7

17 output mTR _i And Rw _i

The implementation effect verification of the application is as follows, the experimental data set used is obtained from a Kibug crowding platform, which is established in 2012 and is a crowding task distributing, collecting and analyzing platform. The 4 test tasks are collected here, namely drawing music, strolling, internet clouds and podcasts.

The examples mainly abstract columns with the properties "device version", "network", "hierarchy", "Bug description" and "test step" as study keywords. When the above 4 crowdsourcing test tasks are finished, most defects are detected, each submitted test report is audited by a manager, and the labels of "valid" and "invalid" are marked, the number of defects of each mobile application software is recorded by a marking person, and the marking result is counted in the following table 4.

TABLE 4 test report labeling results

1380 test reports were collected, including 291 of figures, 408 of strolling, 238 of netclouds, 443 of podcasts. The four test report sets are 61, 193, 149 and 238, respectively, for unlimited number of test reports.

In the preprocessing stage of this embodiment, all test reports are used as input data of a filter and run, valid and invalid test reports are screened, known marked invalid test reports are 61, 193, 149 and 238, after two rule filtering, 61, 189, 147 and 232 invalid test reports are correctly filtered, the filtering accuracy is 97.48% -100%, the filtering rule is effective, but the number of reports of error filtering is more than that of reports of incorrect filtering, and after manual inspection, the valid test report statement is found to contain 'no' words and can be regularly matched, so that the test report statement is filtered.

After the pretreatment is completed, the test reports are clustered and each type of defect level is determined through an MMDBK algorithm, and the accuracy comparison of the defect evaluation and the labeling result is shown in the table 5.

Table 5 defect level assessment results

In table 5, the accuracy of the defect grade evaluation results is about 90%, the highest accuracy reaches 93.65%, and the results show that the accuracy is higher after clustering by comprehensively evaluating the capacity of a submitter and the defect grade parameters.

In the text length index of quality assessment, for determining the parameter x ₁ 、x ₂ 、x ₃ And x ₄ The specific numerical values of the parameters are obtained by adopting a control variable method, fixing the values of three parameters, gradually increasing the value of one parameter, and comparing the precision of a prediction result by using the evaluation index of relative error to obtain the optimal values of the 4 parameters of the text length, namely 9, 15, 23 and 32. The best values for the readability parameters were obtained as-5, -1, 6 and 12. And finally scoring by using the obtained parameters to obtain the comprehensive score of the importance and normalization of the test report. The relative error of each test report score was calculated and averaged as the error of this method for the software evaluation, the relative error of which is shown in table 6 below.

TABLE 6 final score accuracy indicator results

From experimental data, the average relative error of the score of the evaluation test report and the labeling score of the method in the embodiment is 9.24%, and the average relative error of 4 pieces of software is not more than 10%, so that the accuracy and the efficiency of the method can be seen. The method can be applied to a crowdsourcing test platform test report scoring mechanism, and the quality score of the test report can be automatically evaluated in the process, so that the quality of a crowdsourcing test worker for completing a test task is objectively measured from multiple aspects by the crowdsourcing test platform, the expert cost of the crowdsourcing test platform is reduced, and the commercial benefit is increased. For example, when testing and evaluating the mobile application software, the method can effectively evaluate efficiency, reduce evaluation cost, evaluate objective evaluation from both report content and report normalization, and improve accuracy and reliability of evaluation results.

Embodiments of the application, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored on a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. The storage medium includes various media capable of storing program codes, such as a usb (universal serial bus), a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, the present examples are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present application also provide a computer storage medium having a computer program stored thereon. The foregoing mobile application crowd-sourced test report automated assessment method may be implemented when the computer program is executed by a processor. The computer storage medium is, for example, a computer-readable storage medium.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. An automated evaluation method for crowdsourcing test reports of mobile applications is characterized by comprising the following steps:

the method for clustering according to the found defects comprises the following steps:

(2.1) calculating the TF-IDF value of the test report by a TF-IDF algorithm;

(2.2) taking tf-idf values of all test reports as cluster data object O _n ＝{x ₁ ,x ₂ ,...,x _n Clustering, where n is dataNumber of objects;

the clustering is completed through an MMDBK algorithm, and the specific steps are as follows:

(2.2.1) slave data object O _n ＝{x ₁ ,x ₂ ,...,x _n Selecting two objects with the farthest distance from the two objects;

(2.2.2) finding out all objects with a distance from the cluster center less than a threshold d by neighbor search and adding the objects to the proximity class of the center, and recalculating the center of the proximity class;

(2.2.3) calculating a DBI cluster index;

(2.2.4) judging whether the DBI cluster index is minimum, if not, repeating the step (2.2.2) and the step (2.2.3), if so, stopping the circulation, and classifying the rest data objects into the nearest classes;

(2.2.5) outputting a clustering result;

wherein, the steps (2) and (3) are executed in no sequence.

2. The automated mobile application crowdsourcing test report evaluation method of claim 1, wherein the method of rejecting invalid test reports in step (1) comprises:

3. The automated mobile application crowdsourcing test report assessment method of claim 1, wherein: the normalization index in the step (3) includes: text length, readability, action words, object words, negation words, ambiguous words, and interface elements;

4. The automated mobile application crowd-sourced test report assessment method of claim 3, wherein the method of converting normalization scores is:

5. The automated mobile application crowd-sourced test report assessment method of claim 1, wherein the final score calculation method in step (4) is as follows:

final score = 0.7 defect grade +0.3 normalization score.

6. A computer storage medium having a computer program stored thereon, characterized by: the computer program implementing the method of any of claims 1 to 5 when executed by a computer processor.