CN114049475A

CN114049475A - Image processing method and system based on automatic test framework of AI mobile terminal

Info

Publication number: CN114049475A
Application number: CN202111001631.1A
Authority: CN
Inventors: 朱愚; 沈余银; 宋升�; 黄信云
Original assignee: Chengdu Chinamcloud Technology Co ltd
Current assignee: Chengdu Chinamcloud Technology Co ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2022-02-15

Abstract

The invention relates to an image processing method based on an automatic test framework of an AI mobile terminal, which comprises the following steps: the method comprises the steps of dividing a page into a plurality of parts, extracting and processing sub-elements and features of each part, removing the page by cutting the part with the similarity exceeding a threshold value, generating a layout and fusing a plurality of layouts; dividing the standard convolution of the Mobilenetv2 classification model into a depth convolution and a point-by-point convolution, and sequentially carrying out model structure calculation, memory efficient setting and ImageNet classification; simultaneously inputting a gray scale image, an original image and a contour map of an image to predict a classification result, and performing complementation on a plurality of output results; and inputting the screenshot of the picture and a data image layout file to obtain an lstm module and compiling the lstm module into an automatic test script. The invention greatly reduces the compiling cost and the maintenance cost of the test case, improves the cross-platform capability of the frame and leads the test case to be more humanized.

Description

Image processing method and system based on automatic test framework of AI mobile terminal

Technical Field

The invention relates to the technical field of image processing, in particular to an image processing method and system based on an automatic test framework of an AI mobile terminal.

Background

With the continuous exertion of the Android and IOS platforms, the mobile terminal operating system on the market is occupied by the Android and IOS, wherein the share of the Android is more than 80%. In the face of various open-source automatic testing frameworks and tools on the market, the emphasis of the frameworks and tools is not painful, and various problems such as poor cross-platform capability, poor cross-application capability, high ID dependency of stability, high control capturing cost, frequent failure of dump system view trees and the like exist more or less.

The problems faced by large-opening automated test frameworks and tools are: such as uiautoamator: only android4.1(API level 16) and above are supported. Script recording is not supported. The support is mainly Java. You cannot get current activity or instrumentation. Web views are not currently supported. The library supports only the use of Java, and thus is difficult to mix with cucumber using Ruby. If we want to support the BDD framework, we propose to use Java's own BDD framework, e.g. Jbehave; the stability problems faced by Appium; robotium cannot realize cross-platform and cross-application App testing; espresso: compared with Robotium and UIAutomator, the device has the characteristics of smaller scale, more simplicity, more accurate API, simple test code writing and easy and quick operation, but can not realize cross-App application test because of being based on Instrumentation.

Therefore, the traditional framework has the problems of strong dependence on a system view tree, tedious locking of resource IDs and view types, high ID confusion maintenance cost and the like.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides an image processing method and system based on an automatic test framework of an AI mobile terminal, and solves the problems in the prior art.

The purpose of the invention is realized by the following technical scheme: an image processing method based on an AI mobile terminal automatic test framework, the automatic test method comprises the following steps:

and a sub-element similarity page cutting step: the method comprises the steps of dividing a page into a plurality of parts, extracting and processing sub-elements and features of each part, removing the page by cutting the part with the similarity exceeding a threshold value, generating a layout and fusing a plurality of layouts;

a classification model setting step: dividing the standard convolution of the Mobilenetv2 classification model into a depth convolution and a point-by-point convolution, thereby reducing the dimensionality of an activation space, and sequentially carrying out model structure calculation, memory efficient setting and ImageNet classification;

and (3) multiple results are used together: simultaneously inputting a gray scale image, an original image and a contour map of an image to predict a classification result, and performing complementation on a plurality of output results;

vector synthesis: inputting a screenshot of a picture and a description file of a data image layout, obtaining an lstm module formed by connecting a plurality of feature vectors through a neural network, performing feature classification through softmax to predict a current item, and compiling into an automatic test script.

The classification model setting step specifically comprises:

selecting Mobilenetv2 with a first layer as a standard convolution layer and 10 residual bottleneck layers as a classification model, splitting the first layer standard convolution of the Mobilenetv2 classification model into a depth convolution layer and a point-by-point convolution layer by modifying a classification selector, and adopting different convolution kernels for each input channel according to the depth convolution layer, so that the dimensionality of an activation space is reduced;

balancing the calculated amount and the precision of the Mobilenetv2 classification model by a width factor until the manifold interest spans the whole space;

ReLU6 was used as the nonlinear activation function, while a 3 × 3 convolution kernel was used as the size of the standard convolution kernel, and a dropout layer and a BN layer were added during training;

establishing a direct loop-free calculation graph G through TensorFlow or Caffe, wherein edges of the calculation graph represent specific operations, and nodes represent calculation of intermediate tensors;

the decay rate and momentum are both set to 0.9, batch normalization is used after each layer, the standard weight decay rate is set to 0.00004, the initial learning rate is 0.045, the learning decay rate is set to 0.98, 16 GPUs work asynchronously, and one batch size is 96.

The multi-result combination step specifically comprises:

converting the three-channel color image into a single-channel image to realize image gray scale conversion;

dividing the color image into R, G, B three single-channel images, modifying the three channels, and recombining the three channels after modification into the color image;

and counting the occurrence frequency of each gray level in the finally obtained color image histogram, calculating an accumulated normalized histogram and recalculating the pixel value of the pixel point.

The vector synthesis step specifically includes:

inputting a screenshot of a picture and a data image layout file containing image key view node information;

aiming at the image input of a GUI, generating a cnn vector through a key neural network, and cutting a layout file into a sequence of items;

and each item in the sequence generates a feature vector through a ts module, the feature vector and the feature vector obtained in the front are combined to form an lstm module, feature classification is carried out through softmax, and the current item is predicted and compiled into an automatic test script.

Generating a feature vector for each item in the sequence by the ts module comprises:

extracting SIFT features of the pictures through a BOVW algorithm, wherein SIFT feature image data extracted from a single picture is not fixed;

traversing each SIFT feature image, extracting n D-dimensional feature vectors to obtain a local feature set F, clustering the feature sets obtained by learning a visual dictionary through a clustering algorithm to obtain a visual dictionary with dimensions, and obtaining a global feature map of the image by using the visual dictionary;

the global feature map is described by the fisher vector algorithm with the gradient vector of the likelihood function.

The description of the global feature map by the gradient vector of the likelihood function through the fisher vector algorithm comprises the following steps:

a1, selecting the size of K in GMM, solving GMM by using all the characteristics in the training picture set to obtain each parameter, and taking an image to be coded to obtain the characteristic set;

a2, using the prior remnant of the GMM and the feature set of the image, the FV is obtained according to step A1.

A system of an image processing method based on an AI mobile terminal automatic test framework comprises a sub-element similarity cutting page unit, a classification model setting unit, a multi-result combination unit and a vector synthesis unit;

the sub-element similarity cuts page units: the system comprises a page segmentation module, a page matching module and a page matching module, wherein the page segmentation module is used for segmenting a page into a plurality of parts, extracting and processing sub-elements and features of each part, removing the page by cutting the part with the similarity exceeding a threshold value, generating a layout drawing and fusing the plurality of layout drawings;

the classification model setting unit: the method is used for splitting the standard convolution of the Mobilenetv2 classification model into a depth convolution and a point-by-point convolution, so that the dimensionality of an activation space is reduced, and model structure calculation, memory efficient setting and ImageNet classification are sequentially carried out;

the multi-result combination unit: the method comprises the steps of simultaneously inputting a gray scale image, an original image and a contour image of an image, predicting classification results, and performing complementation on a plurality of output results;

the vector synthesis unit: the method comprises the steps of inputting a screenshot of a picture and a description file of a data image layout, obtaining an lstm module formed by a plurality of feature vector sets through a neural network, conducting feature classification through softmax to predict a current item, and compiling into an automatic test script.

The classification model setting unit comprises a classification model selecting subunit, a model structure calculating subunit, a memory efficient setting subunit and a classification subunit;

the classification model selection subunit: the method comprises the steps that Mobilenetv2 used for selecting a first layer as a standard convolution layer and 10 residual bottleneck layers serves as a classification model, the first layer standard convolution of the Mobilenetv2 classification model is divided into a depth convolution layer and a point-by-point convolution layer through modifying a classification selector, and different convolution kernels are adopted for each input channel according to the depth convolution layer, so that the dimensionality of an activation space is reduced; balancing the calculated amount and the precision of the Mobilenetv2 classification model by a width factor until the manifold interest spans the whole space;

the model structure calculation subunit: the method is used for using the ReLU6 as a nonlinear activation function, simultaneously using a convolution kernel of 3x 3 as the size of a standard convolution kernel, and adding a dropout layer and a BN layer during training;

the memory efficient setting subunit: the method is used for establishing a direct acyclic calculation graph G through TensorFlow or Caffe, wherein edges of the calculation graph represent specific operations, and nodes represent calculation of intermediate tensors;

the classification subunit: for setting the decay rate and momentum to 0.9, batch normalization is used after each layer, the standard weight decay rate is set to 0.00004, the initial learning rate is 0.045, and the learning decay rate is set to 0.98, 16 GPUs work asynchronously, and one batch size is 96.

The invention has the following advantages: an image processing method and system based on an AI mobile terminal automatic test framework are easy to develop and maintain, strong in stability, high in execution efficiency, capable of realizing cross-platform and cross-application and simultaneously capable of supporting hybird's mobile terminal-oriented WYSIWYG (what you see is what you get) automatic test framework. The compiling cost and the maintenance cost of the test cases are greatly reduced, the cross-platform capacity of the framework is improved, and the test cases become more humanized.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is an exemplary diagram of similarity clustering;

fig. 3 is a diagram illustrating the effect of similarity clustering.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention automatically cuts a test screenshot into a plurality of parts by using an automatic image cutting technique, marks the position of a text region by using an OCR recognition technique, recognizes the corresponding element position by using an image recognition technique, simultaneously generates a layout, repeats generating a plurality of drawings, fuses the plurality of drawings, and further performs detailed analysis on a new list mainly for the accuracy of the layout after fusing images.

Firstly, the whole framework of the invention adopts a MobileNet V2 model, and the execution efficiency of the model is higher on the premise of meeting the accuracy of the service requirement. Meanwhile, in the training process of image cutting, the first step is the condition that training materials are not uniformly distributed, and at the moment, the method adopts a script generation and page information loading mode of acquiring more types of APP to expand the data source of the whole UI mobile test frame, so that a relatively balanced training material is realized through screening. Secondly, on the aspect of improving the accuracy rate, the invention switches from a Top-Layer mode of migration training to a Fine-tune mode, and improves the overall accuracy rate by adopting a mode of combining multiple images and multiple models.

Then, marking the position of the text region by using an OCR recognition technology, and sequentially carrying out graying, binarization and image noise reduction;

wherein, graying: and (3) converting the picture into grey scale in a reverse running mode by adopting an RGB model, and realizing the grey scale mode of the picture according to an algorithm formula of Gray of 0.299R +0.587G +0.114B in the RGB model.

Binarization: most images on an APP interface are color images, the information content of the color images is huge, the images can be simply divided into foreground and background, in order to enable a computer to recognize characters more quickly and better, the color images need to be processed first, so that only foreground information and background information of the images are processed, a binarization threshold value is searched through a histogram method (also called a bimodal method), the foreground information is defined to be black, the background information is defined to be white, and the binarization image of the images is realized.

Image denoising: in order to reduce noise in a digital image, when a picture after binarization shows a plurality of small black points which are unnecessary information and can cause great influence on the subsequent picture contour cutting recognition, the picture is further processed by noise reduction, all connected regions (which are 1 and look black and are connected) are searched for by a w × h bitmap in an image data structure, all the connected regions calculate an average pixel value, and if the pixel value of some connected regions is far lower than the average value, the connected regions are considered as noise points. Then, he is replaced by 0, and finally, the noise reduction processing of the image is realized.

Finally, simultaneously generating a layout diagram through the element positions respectively corresponding to the steps, fusing a plurality of layout diagrams to ensure the accuracy of the fused layout, and performing detailed analysis on a new list; if the view has a list or a grid view, the view is assembled into a list, and finally a layout file is generated, and then an automatic test script is generated according to the layout file, so that the mobile terminal can automatically identify according to the UI interface, and an automatic test code is formed.

The method specifically comprises the following steps: 1. cutting a page by the sub-element similarity;

the page is divided into a plurality of parts, self elements of each part are mutually extracted, then the characteristics of each element are processed, the similarity of each block part is compared after the processing is finished, and only the phase module with the similarity exceeding 99 percent is cut to remove the page.

Similarity clustering: for two isolated pixel points in the image, the colors are corresponding to different points, so that the similarity of the two points is measured by using the distance of the colors; in the invention, the RGB distance and a visual uniform value taking tool (perceptual uniform) are used for obtaining the color space ratio between the bright points, so that the clustering analysis can be carried out on the gray level image by using the brightness value, and the difference between two pixel points is measured by the difference of the brightness values. The accuracy of cutting the layout diagram is improved through similarity clustering, and similarity analysis is performed by intercepting a plurality of diagrams. The cutting accuracy of buttons, elements and the like of the test layout is improved.

When two areas, a square and a diamond, have been formed, it is inferred by the line connecting the square and the diamond whether the two areas merge, as shown in fig. 2. Also by setting a threshold value, when the difference (i.e., dissimilarity) between two pixels is smaller than this value, the two are combined into one. The iteration is combined, and finally, the areas are combined. By the method, different area information is used for clustering, and the similarity measurement standard is changed, so that the clustering can be carried out to obtain the required specific shape.

As shown in fig. 3, a screenshot is cut into several blocks: tab, navigation, status bar, etc., then using deep learning image classification to classify and identify each block, after the identification, extracting the sub-elements in the corresponding block, then using AI technology to extract the contents in the sub-elements, filling the contents in the attributes of the sub-elements, finally obtaining the structure of the secondary view tree, and finally, performing corresponding click operation.

2. Selecting a classification model;

in various classification models, a classification model with high performance and low accuracy or low performance and high accuracy is selected, and according to the service requirement of the practical App, mobilenetv2 which is executed relatively in balance is selected from the classification models, the classification models have higher execution efficiency but lower accuracy compared with other classification models, but can completely meet the service requirement of the current App, and for image recognition, a graph is recognized within 200ms-300ms approximately in time, and the accuracy can also reach more than 99.2%.

While the standard convolution is split into two operations by changing the class selector: the depth convolution differs from the point-by-point convolution, the Depthwise convolution for which the convolution kernel is used on all input channels, from the standard convolution for which a different convolution kernel is used for each input channel. The layer dimension is reduced too much to reduce the dimension of the activation space. MobileNetv2 is a compromise between computational complexity and accuracy by a width factor until the manifest of interest (referring to our desired data content) spans the entire space. When the activation function is ReLU, the neural network acts as a linear classifier in the non-zero part of the output domain.

Wherein the width factor controls the dimension of the activation space until the manifold of interest spans the entire space. The scheme of taking a relative compromise between the calculation amount and the calculation accuracy does not need to spend a large amount of calculation to pursue too high accuracy, but can ensure that the calculation accuracy can meet the service requirement.

Model structure calculation

The basic unit of MobileNet v2 is a bottle neck residual block, and the first layer convolution of MobileNet v2 is a standard convolution of 32 convolution kernels, followed by 19 residual bottle neck layers. The reason why the ReLU6 is used as the nonlinear activation function is that the nonlinear activation function is more robust in low-precision calculation, and meanwhile, the 3x 3 convolution kernel is used as the size of a standard convolution kernel, so that dropout and BN are added in the training period.

Memory efficient reasoning

For mobile application, it is very important that an inverted residual bottleneck layer allows a special memory to be executed efficiently, the standard method for the inferred efficient execution is to establish a direct loop-free calculation graph G by using TensorFlow or Caffe, the edges of the calculation graph represent specific operations, and the nodes represent the calculation of an intermediate tensor; the calculations are performed sequentially in order to minimize the number of tensors that need to be stored. Typically, all reasonable calculation orders are searched to pick the smallest one.

ImageNet classification

Training details: we used the tensrflow framework, the standard RMSProp optimization method, and set both decay rate and momentum to 0.9, batch normalization after each layer, the standard weight decay rate to 0.00004, the same initial learning rate of 0.045 as in V1, and the learning decay rate to 0.98, 16 GPUs operating asynchronously, one batch size of 96.

3. Combining multiple fruits;

aiming at the frame (TensorFlow frame), the original image input of an App end can be realized, and an output result is obtained according to the operation result of the frame. In order to make the result more accurate, the accuracy of the result is improved. The three graphs of the gray scale graph, the original graph and the contour graph can be generated, and simultaneously are respectively input to be used as classification result prediction, and the output results of multiple results can form complementation, so that the overall effect of the device is well improved.

B1, converting the three-channel color image into a single-channel image to realize image gray scale conversion;

the concrete formula is as follows: b2 to B1 are GRAY ═ B × 0.114+ G × 0.387+ R × 0.299; b1 to B2 are R ═ B ═ GRAY.

B2, dividing the color image into R, G, B single-channel images, modifying the three channels, modifying the rgb value to convert the images into the color image, and recombining the three channels after modification into the color image;

and B3, counting the occurrence frequency of each gray level in the color image histogram finally obtained, calculating an accumulated normalized histogram and recalculating the pixel value of the pixel point.

4. Vector synthesis;

the method comprises the steps of inputting a screenshot of an input picture and a data image layout file through two core inputs, wherein the data image layout file comprises view node information of image keys. For GUI image input, a cnn vector is generated through a key neural network, then a layout file is cut into a sequence of items, each item of the sequence also generates a feature vector through a ts module (a service module name for generating the sequence into the feature vector), the feature vector is subjected to aggregation with the previous feature vector to form an lstm module, then a feature classification is performed through softmax, then the current item is predicted, and finally an automatic test script is compiled.

Specifically, local features (SIFT, SURF) of an image are extracted, and generally, the number of the local features is large, and the local features need to be aggregated into a single vector, so that the subsequent indexing and retrieval processes are facilitated. The method is specifically realized by three algorithms of BOVW, VLAD and FV.

Further, the BOVW represents an image by a set of features, and the features are composed of key points and descriptors. The keypoints are "salient" points of an image, which are always the same whether the image is rotated or zoomed.

The extracted local features have higher discrimination, and SIFT features are generally used to meet the requirements of rotation invariance and scale invariance, wherein the SIFT features extracted from a single picture are not fixed in quantity.

The flow of the VLAD includes:

and traversing each image (the extracted screenshot in the app client is used for cutting the image set), extracting n D-dimensional feature vectors (the SIFT feature values are used for distinguishing each image differently), and finally obtaining a local feature set F.

The visual dictionary with the dimensionality is obtained by clustering the learned visual dictionary by using a feature set F obtained by a clustering algorithm (K-Means and the like).

And then forming a corresponding global feature map through a visual dictionary. . And finally accumulating all the characteristic residuals of each clustering center to finally obtain K global characteristics. The K global features express a certain distribution of local features in a clustering range, and the distribution only keeps the distribution difference between the local features and a clustering center by erasing the feature distribution difference of the image.

The specific process of fv (fisher vector) includes:

an image is essentially represented by a gradient vector of likelihood functions. The physical meaning of the gradient vector is that the process of making the model better adapt to the parameter change direction of the data, namely parameter tuning in data fitting, is described. The core steps are as follows: (1) selecting the size of m in GMM (Gaussian mixture model); (2) solving the GMM (which can be achieved by an EM method) by using all the features (or a subset thereof) in a training picture set (an image set for cutting the extracted screenshot in the app client), and obtaining each parameter; (3) taking an image to be coded, and solving a characteristic set of the image; (4) obtaining FV of the image according to the steps by using the prior parameter of the GMM and the characteristic set of the image; through encoding of the fisher vector, the dimensionality of the image features is improved, and the image can be better described.

Finally, a series of local features of the picture can be obtained through algorithms such as BOVW, VLAD and FV based on feature codes of feature description operators, the feature vector and the previous feature vector are connected in a gathering mode to form an lstm module, and then the lstm module is compiled into an automatic test script.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image processing method based on an automatic test framework of an AI mobile terminal is characterized in that: the automatic test method comprises the following steps:

vector synthesis: inputting a screenshot of a picture and a data image layout file, obtaining an lstm module formed by connecting a plurality of feature vectors through a neural network, performing feature classification through softmax to predict a current item, and compiling into an automatic test script.

2. The image processing method based on the AI mobile terminal automated test framework as recited in claim 1, wherein: the classification model setting step specifically comprises:

3. The image processing method based on the AI mobile terminal automated test framework as recited in claim 1, wherein: the multi-result combination step specifically comprises:

4. The image processing method based on the AI mobile terminal automated test framework as recited in claim 1, wherein: the vector synthesis step specifically includes:

5. The image processing method based on the AI mobile terminal automated test framework as recited in claim 4, wherein: generating a feature vector for each item in the sequence by the ts module comprises:

6. The image processing method based on the AI mobile terminal automated test framework as recited in claim 5, wherein: the description of the global feature map by the gradient vector of the likelihood function through the fisher vector algorithm comprises the following steps:

7. The system of the image processing method based on the AI mobile terminal automated test framework as set forth in any of claims 1-5, wherein: the system comprises a sub-element similarity cutting page unit, a classification model setting unit, a multi-result combination unit and a vector synthesis unit;

8. The system of claim 7, wherein the system is configured to perform the image processing method based on an AI mobile terminal automated test framework, and comprises: the classification model setting unit comprises a classification model selecting subunit, a model structure calculating subunit, a memory efficient setting subunit and a classification subunit;