CN113610862B

CN113610862B - Screen content image quality assessment method

Info

Publication number: CN113610862B
Application number: CN202110831904.9A
Authority: CN
Inventors: 王同罕; 廖静; 何月顺; 周书民; 徐洪珍; 李祥; 何剑锋; 贾惠珍; 李广
Original assignee: East China Institute of Technology
Current assignee: East China Institute of Technology
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2023-08-01
Anticipated expiration: 2041-07-22
Also published as: CN113610862A

Abstract

The invention relates to the technical field of image processing, and discloses a screen content image quality evaluation method, which comprises the following steps: dividing the screen content image into a text area and an image area; extracting texture features and image structural features of an image area; extracting definition and text structural characteristics of a text region; inputting the texture features, the image structure features, the definition, the text structure features and the subjective quality score of the screen content image into LIBSVM software for training to obtain a quality assessment model; and inputting the screen content image to be evaluated, and inputting the screen content image to be evaluated into a quality evaluation model after processing to obtain the quality score. The invention can sense the quality of the image, dynamically detect and adjust the image processing system to output high-quality images according to the quality score of the image, and provide more effective basis for the parameter optimization of the real-time client communication system.

Description

Screen content image quality assessment method

Technical Field

The invention relates to the technical field of image processing, in particular to a screen content image quality assessment method.

Background

At present, with the development of the internet and multimedia technologies, the real-time image communication system and the screen sharing technology are also mature, so that a large number of screen content images are filled on the internet, and how to evaluate the quality of the images becomes a troublesome problem. The quality evaluation of the screen content image plays a great role in the image communication transmission and the real-time multi-client communication system, and the quality of the current image can be obtained through an evaluation algorithm so as to optimize various parameters of the image transmission system to improve the performance. In order to be more consistent with the visual perception effect of human beings, the evaluation method of the screen content image is used for evaluating the characteristics related to the quality of the image extraction. The feature extraction method of the image is mainly divided into two types, one type is traditional manual extraction, and people extract a plurality of features (such as natural scene statistics, edge structure features and the like) in the image according to priori knowledge for calculation; the other type is a deep learning-based method, and effective quality characteristics are automatically obtained through training, so that quality scores are obtained. Different feature extraction methods determine to some extent the efficiency and time complexity of the algorithm. The evaluation method is also classified into three types according to whether or not there is an original reference image: full-reference, half-reference, no-reference quality evaluation methods. Different types of evaluation methods can limit the application scenarios. The screen content image comprises text and image areas, and the conventional evaluation method mainly starts from the whole area of the image, but does not consider that the visual perception of different areas to human eyes has large difference.

There are several methods, the first, a graphic and text region based stacked automatic encoder (SAE, stacked AutoEncoders) method, which uses a fast document layout analysis algorithm based on convolutional neural network (CNN, convolutional Neural Networks) to divide the content in the image into blocks, input a 1-D CNN model, then classify it into text, form and image, then extract quality perception features from the text region and image region, respectively, then train two different SAE by an unsupervised method for extracting quality perception features from these two regions, then input the features and their corresponding subjective scores into two regressors, each regressor can obtain one output prediction score, finally calculate the final perceived quality score of the test SCI from these two prediction scores by a weighted model. The method uses a CNN convolution model to classify the image content, which greatly increases the complexity of the algorithm, and the human eye is mainly interested in text regions as well as image regions. The classification of the images of the text form into several classes and their merging into text and image areas makes the above classification step somewhat superfluous.

Second, the CNN-SQE method improves quality prediction performance by dividing the blurred classification of the screen content image into plain text, computer graphics/cartoon and natural image regions, and performing quality estimation on the different regions, respectively. The operation is mainly performed through three stages: (1) image segmentation; (2) quality assessment of each segmented region; (3) mass combination. The method classifies computer graphics/cartoons, plain text, and natural image regions, while all content of the screen content image is digitally generated by a computer without excessive classification.

Third, a method based on image structural features and uncertainty weighting (SFUW, structure Features And Uncertainty Weighting) first divides a screen content image (SCI, screen Content Image) into text and image regions, then extracts gradient information of the text region as structural features and brightness features, and obtains visual quality of the text region by calculating structural similarity of image blocks, and then effectively fuses the visual quality of the text and image regions to obtain a final quality score using an uncertainty weighting method based on perceptual theory. The method needs to acquire an original reference image, but the original image in the actual situation is difficult to acquire, the algorithm has limitation, and how the weight is set needs to be considered.

In summary, since the information storage and transmission process is limited by the transmission device and the interference of external electrons, the transmitted image is polluted to a certain extent, so that in most cases, the original reference image without damage cannot be obtained, and therefore, the full-reference method is limited in application level. Meanwhile, the screen content image mainly consists of text and image areas, and the complexity of an algorithm is increased easily due to too much area division, so that the calculation efficiency is reduced, and therefore, the design of which image segmentation method is selected is particularly important.

Disclosure of Invention

In order to solve the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for evaluating the quality of a screen content image, which firstly performs region segmentation on the screen content image to obtain text and image regions, extracts feature vectors representing the quality of the image according to different region characteristics, and finally trains a predictive model from quality perception function to SCI vision quality by using support vector regression (SVR, support Vector Regression) with radial basis function (RBF, radial Basis Function) kernel. The high-efficiency and convenient evaluation algorithm can sense the quality of the image, can dynamically detect and adjust the image processing system to output high-quality images according to the quality scores of the images, and provides more effective basis for the parameter optimization of the real-time client communication system.

In order to achieve the above purpose, the present invention provides the following technical solutions: a screen content image quality assessment method comprising the steps of:

(1) Constructing a screen content image database;

(2) Executing a text segmentation function on the screen content image, and segmenting the screen content image into a text region and an image region;

(3) Executing an image area quality evaluation function on the image area, and extracting texture features and image structure features of the image area;

(4) Executing a text region quality evaluation function on the text region, and extracting definition and text structure characteristics of the text region;

(5) Randomly selecting screen content images of a thousand screen content image databases, inputting texture features, image structure features, definition, text structure features and subjective quality scores of the thousand screen content images into LIBSVM software for training, and obtaining a quality assessment model;

(6) Inputting a screen content image to be evaluated, and inputting texture features, image structural features, definition, text structural features and subjective quality scores of the screen content image to be evaluated into a quality evaluation model after processing in the steps (2), (3) and (4) to obtain quality scores.

Further, the step (2) specifically includes the following steps: first, a first threshold is dynamically set byFind all the maximum stable extremum regions, where Q _i Represents a certain communication region when the first threshold value is i, Δ represents a minute first threshold value change, v (i) is a region Q when the first threshold value is i _i Rate of change of v _(i) If the value is smaller than the given first threshold value, the region Qi is considered as the maximum stable extremum region; secondly, setting a second threshold value of the eccentricity of an ellipse with the same standard second-order central moment as the region, a third threshold value of Euler number, a fourth threshold value and a fifth threshold value of the ratio of the number of pixels in the region to the total pixels in the boundary box, and a sixth threshold value of the proportion of pixels in the convex hull, calculating the eccentricity of the ellipse with the same standard second-order central moment as the region in the maximum stable extremum region, the Euler number, the ratio of the number of pixels in the region to the total pixels in the boundary box, and the proportion of pixels in the convex hull, and determining the first text region when the calculated eccentricity of the ellipse with the same standard second-order central moment as the region is larger than the second threshold value and the Euler number is smaller than the third threshold value, and the proportion of the number of pixels in the region to the total pixels in the boundary box is lower than the fourth threshold value or larger than the fifth threshold value and the sixth threshold value; then, a seventh threshold value of the stroke width change rate is set, the stroke width change rate of the first text region is calculated, when the change rate is larger than the seventh threshold value, the second text region can be confirmed, finally, the second text regions in all the maximum stable extremum regions are extracted and combined to serve as text regions, and the rest regions of the screen content image are combined to serve as image regions.

Further, the step (3) specifically includes the following steps:

s1: texture features of an image region are extracted by, firstThe Scharr operator calculates a gradient map g (i, j) of the image region and normalizes the gradient map:wherein [ among others ]]To get the whole operation, g _max The maximum value of the original gradient value is L, and the normalized maximum gray level number is L; then, the gray scale f (i, j) of the image area is normalized: />f _max Is the gray maximum value in the original gray map; then, a gray-gradient co-occurrence matrix M is constructed, the horizontal increment is a gradient value, the vertical increment is a gray value, and the origin is positioned at the sitting vertex of the matrix. M is defined as M (i, j) = # g (M, N) = i, f (M, N) = j, m=0, 1, 2..m-1, n=0, 1, 2..n-1 }, where M x N is the size of the gradient and gray map, # { } is expressed as the number of elements in the set, and finally extracting the statistical features of the gray-gradient co-occurrence matrix comprises: gradient entropy->Gray entropyEnergy->Gray scale mean valueGradient mean->Standard deviation of gradientGray standard deviation->As texture features of image areas, wherein the total number of occurrences of (i, j) is normalizedThe probability of occurrence P (i, j);

s2: extracting image structural features of an image area, firstly, partitioning the image area into n multiplied by n partial image blocks with equal size, performing partial two-dimensional discrete cosine transform on each image block to obtain DCT coefficients, then fitting the DCT coefficients by using a generalized Gaussian distribution model, obtaining shape parameters gamma of the image block after fitting, taking the average value of the gamma values of the lowest 10% as a first structural feature, taking the average value of all gamma values as a second structural feature, and then calculating frequency change coefficientsWhere σ|X| is the variance of the block, μ|X| is the mean of the block, taken +.>As a third structural feature, the maximum 10% of the mean value of>As a fourth structural feature, then, to acquire direction information from the partial image block, the DCT coefficient block is divided into low, medium, and high 3 frequency bands, and then the average energy in each frequency band is calculated: />Wherein n is a positive integer, sigma _n For the variance of band n, the ratio of sub-band energies is calculated: />R is taken _n Taking the highest 10% of the mean value of R as the fifth structural feature _n As a sixth structural feature, finally, in order to extract the direction information, the DCT coefficients are divided into 3 parts in 3 directions according to the vertical direction of the radial frequency variation, and then the frequency variation coefficient +_ in 3 parts is calculated>Calculate->The variance of (2), the mean of the highest 10% of the differences was taken as the seventh structural feature, and +.>As an eighth structural feature.

Further, the step (4) specifically includes the following steps:

s1: extracting the definition of a text region, firstly, filtering in x and y directions, normalizing the filtered image compared with the maximum value in the filtered image, and when the normalized pixel point value is greater than a preset threshold value (such as 0.0001), taking the pixel point as a possible edge pixel, and then calculating the difference delta DoM of median filtering image difference in the horizontal direction and the vertical direction respectively, wherein the horizontal direction is as follows: ΔDoM _x (i,j)＝[I _M (i+2,j)-I _M (i,j)]-[I _M (i,j)-I _M (i-2,j)]Vertical direction: ΔDoM _y (i,j)＝[I _M (i,j+2)-I _M (i,j)]-[I _M (i,j)-I _M (i,j-2)]Wherein I _M (i, j) is the gray value of the median filtered image at pixel (i, j), using the difference of deviation 2, the sharpness in the x-direction at pixel (i, j) is defined as:the same applies to the definition calculation in the y-direction, wherein Σ _{i-w≤k≤i+w} |ΔDoM _x (k, j) indicates summing ΔDoM over a window of size 2w+1, normalizing contrast at edges, Σ _{i-w≤k≤i+w} I (k, j) -I (k-1, j) is the contrast at a window size of 2w+1, when S _x If (i, j) is greater than the preset threshold, the pixel point at (i, j) is clear, and finally, the definition of the image of the region is defined as:wherein: /># sharpPixels is the number of clear pixels and# edgePixels is the number of edge pixels;

s2: extracting text structural features of a text region, firstly, calculating a gradient map GM of the text region, and calculating gradients at image pixels (i, j) as follows:wherein the method comprises the steps of h represents the gradient operator and,representing a convolution operation. A local binary pattern LBP of rotation invariance is then calculated on the gradient map GM,where delta represents a unified metric, U represents the number of neighboring pixels, S represents the radius value of the field, ρ is defined as a threshold function,G _k ，G _C the GM values, expressed as center coordinates and their neighborhoods, are then computed, and it is observed that GMLBP may contain u+2 different modes, which may be combined into one bin of the histogram, set U to 8, so that the histogram has 10 bins in total, and are computed separately at three scales, the original image, the downsampled image with a downsampling factor of 2, and the downsampled image with a downsampling factor of 4, so that 30 text structural features are extracted in total.

The invention has the beneficial effects that: a segmentation method based on digital text is used for dividing a text region and an image region of a screen content image, the text region is extracted to effectively represent definition of the text region and text structural features of a gradient domain as feature vectors of the text region, the image region is based on statistical features extracted by a gray-gradient co-occurrence matrix as texture features, the structural features extracted by a DCT domain are used for representing the feature vectors of the image region, a regression model is trained by an SVM method to obtain more accurate quality scores, rapid and efficient assessment can be performed on the screen content image existing on the Internet, and more reliable method basis is provided for subsequent image quality optimization, denoising, fusion and other directions according to the assessed effect.

Drawings

Fig. 1 is a screen content image evaluation flow of a screen content image quality evaluation method of the present invention.

Fig. 2 is a text segmentation flow of a screen content image quality evaluation method according to the present invention.

Fig. 3 is a block frequency division diagram of DCT coefficients in a screen content image quality evaluation method according to the present invention.

Fig. 4 is a graph showing the division of DCT coefficients according to radial frequency variation in a screen content image quality evaluation method according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the present invention is not limited to the specific embodiments disclosed below.

As shown in fig. 1-2, a screen content image quality evaluation method includes the steps of:

(1) Constructing a screen content image database, installing matlabR2016a, and creating a matt test file;

(2) Performing a text segmentation function on the screen content image; for efficient segmentation of text regions a text localization method for native Digital (BD) text is used, since this method is text extraction for Digital text images, we areThe text in the screen content image is mostly digitized text, the text has larger difference with the document text and the scene text, the text region positioning is more targeted and more accurate by using the method, firstly, a first threshold value is dynamically set, and the text region is positioned by using the methodFind all the maximally stable extremal regions (MSER, maximally Stable Extremal Regions), where Q _i Represents a certain communication region when the first threshold value is i, Δ represents a minute first threshold value change, v (i) is a region Q when the first threshold value is i _i Rate of change of v _(i) If the value is smaller than the given first threshold value, the region Qi is considered as the maximum stable extremum region; secondly, setting a second threshold value of the eccentricity of an ellipse with the same standard second-order central moment as the region to be 0.995, a third threshold value of Euler number to be-4, a fourth threshold value of the ratio of the number of pixels in the region to the total pixels in the boundary box to be 0.2 and a fifth threshold value to be 0.9, and a sixth threshold value of the proportion of the pixels in the convex hull to be 0.3, calculating the eccentricity of the ellipse with the same standard second-order central moment as the region in the maximum stable extremum region, euler number, the ratio of the number of pixels in the region to the total pixels in the boundary box, and the proportion of the pixels in the convex hull to be limited to the sixth threshold value, and determining the first text region when the calculated eccentricity of the ellipse with the same standard second-order central moment as the region is larger than the second threshold value, euler number is smaller than the third threshold value, and the ratio of the number of pixels in the region to the total pixels in the boundary box is smaller than the fourth threshold value or larger than the fifth threshold value; then, setting a seventh threshold value of the stroke width change rate to be 0.3, calculating the stroke width change rate of the first text region, when the change rate is larger than the seventh threshold value, confirming a second text region, finally, extracting and merging all the second text regions with the maximum stable extremum regions as text regions, and merging the rest regions of the screen content image as image regions;

(3) Performing an image region quality assessment function on an image region, comprising the steps of:

s1: extracting texture features of the image area; the image region contains a large number of textures and structures, so we use the statistical features of the gray-gradient co-occurrence matrix as the texture features of the image region, first, calculate the gradient map of the image region by Scharr operator, and normalize the gradient map:wherein [ among others ]]To get the whole operation, g _max The maximum value of the original gradient value is L, and the normalized maximum gray level number is L; then, the gray scale map of the image area is normalized: />f _max Is the gray maximum value in the original gray map; then, a gray-gradient co-occurrence matrix M is constructed, the horizontal increment is a gradient value, the vertical increment is a gray value, and the origin is positioned at the sitting vertex of the matrix. M is defined as M (i, j) = # g (M, N) = i, f (M, N) = j, m=0, 1, 2..m-1, n=0, 1, 2..n-1 }, where M x N is the size of the gradient and gray map, # { } is expressed as the number of elements in the set, and finally extracting the statistical features of the gray-gradient co-occurrence matrix comprises: gradient entropy->Gray entropyEnergy->Gray scale mean valueGradient mean->Standard deviation of gradientGray standard deviation->As a texture feature of the image area, wherein the total number of occurrences of (i, j) is normalized to the probability of occurrence P (i, j);

s2: extracting image structural features of the image area; image region has a lot of structural information besides texture information, so that the image structural features of the image region need to be extracted for quality assessment features, firstly, the image region is divided into n×n partial image blocks with equal size, and each image block is subjected to partial two-dimensional discrete cosine transform (DCT, discrete Cosine Tansform) to obtain DCT coefficients, then the DCT coefficients are fitted by using generalized Gaussian distribution (GGD, generalized Gaussian Distribution) model, after the fitting, the shape parameters gamma of the image block are obtained, the average value of the gamma value of the lowest 10% is taken as the first structural feature, the average value of all gamma values is taken as the second structural feature, which is pooling, the prior study shows that the pooling can improve the correlation with subjective quality perception, the following operation is the same, and then the frequency change coefficient is calculatedWhere σ|X| is the variance of the block, μ|X| is the mean of the block, taken +.>As a third structural feature, the maximum 10% of the mean value of>As a fourth structural feature, after that, in order to acquire direction information from the partial image block, the DCT coefficient block is divided into low, medium, and high 3 frequency bands, the hatching represents the division of three frequency bands as shown in fig. 3, and then the average energy in each frequency band is calculated: />Wherein n is a positive integer, sigma _n For the variance of band n, the ratio of sub-band energies is calculated：/>R is taken _n Taking the highest 10% of the mean value of R as the fifth structural feature _n As a sixth structural feature, finally, in order to extract direction information, the DCT coefficients are divided into 3 parts in 3 directions according to the vertical direction of radial frequency variation, divided as shown by hatching in fig. 4, and then the frequency variation coefficients in 3 directions are calculatedCalculate->The variance of (2), the mean of the highest 10% of the differences was taken as the seventh structural feature, and +.>As an eighth structural feature;

(4) Executing a text region quality assessment function on the text region, comprising the steps of:

s1: extracting the definition of the text region; since the sharpness of a text region affects the visual perception quality of human eyes, we have to make a measure of the sharpness of the text region, we use the difference Δdom (Difference Of Differences In Grayscale Values Of A Median-filtered Image) of median filtered Image difference as a measure feature of whether the edge is sharp, which can determine whether the edge is sharp by whether the slope changes rapidly, firstly, filter in x and y directions, and normalize the maximum value in the filtered Image, and when the normalized pixel value is greater than a preset threshold value of 0.0001, the pixel is used as a possible edge pixel, and then calculate the difference Δdom of median filtered Image difference in horizontal direction and vertical direction, respectively: ΔDoM _x (i,j)＝[I _M (i+2,j)-I _M (i,j)]-[I _M (i,j)-I _M (i-2,j)]Vertical direction: ΔDoM _y (i,j)＝[I _M (i,j+2)-I _M (i,j)]-[I _M (i,j)-I _M (i,j-2)]Wherein I _M (i, j) is the gray value of the median filtered image at pixel (i, j) and in order to make the variation of the edge intensity more stable and thus use the difference of deviation 2, the sharpness in x-direction at pixel (i, j) is defined as:the same applies to the definition calculation in the y-direction, wherein Σ _{i-w≤k≤i+w} |ΔDoM _x (k, j) indicates summing ΔDoM over a window of size 2w+1, normalizing contrast at edges, Σ _{i-w≤k≤i+w} I (k, j) -I (k-1, j) is the contrast at a window size of 2w+1, when S _x If (i, j) is greater than the preset threshold, the pixel point at (i, j) is clear, and finally, the definition of the image of the region is defined as:wherein: /># sharpPixels is the number of clear pixels and# edgePixels is the number of edge pixels;

s2: extracting text structural features of a text region, firstly, calculating a gradient map GM of the text region, and calculating gradients at image pixels (i, j) as follows:wherein the method comprises the steps of h represents the gradient operator and,representing a convolution operation. A local binary pattern LBP of rotation invariance is then calculated on the gradient map,where delta represents a unified metric, U represents the number of neighboring pixels, S represents the radius value of the field, ρ is defined as a threshold function,G _k ，G _C the GM values expressed as center coordinates and their neighborhoods, then calculate the GMLBP histogram, observe that GMLBP may contain u+2 different modes, which may be combined into one bin of the histogram, set U to 8, so that the histogram has 10 bins in total, and calculate each at three scales, the three scales being the original image, the downsampled image with a downsampling factor of 2, the downsampled image with a downsampling factor of 4, so that 30 text structural features are extracted in total;

(5) Downloading LIBSVM software package, setting parameters of SVM model, wherein the set model is SVR regression model, kernel function type is RBF kernel function, randomly selecting screen content images of one thousand screen content image databases, inputting texture features, image structure features, definition, text structure features and subjective quality score of one thousand screen content images into LIBSVM software for training: selecting 80% of images in a screen content image database as a training set for one thousand times randomly, using 20% of images as a test set, obtaining a quality evaluation model after regression training by using training set data, using the test set to verify the quality evaluation model, and calculating a correlation coefficient between the score of the training set image and the subjective quality score estimated by using the quality evaluation model each time, wherein the correlation coefficient comprises SROCC, PLCC, KRCC and RMSE, the correlation coefficient can reflect the error and correlation of the score obtained by the quality evaluation model and the subjective quality score, can be used as an evaluation index of algorithm quality, and taking a median value as a final correlation coefficient value respectively after the training of thousands times randomly;

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A screen content image quality evaluation method, comprising the steps of:

(1) Constructing a screen content image database;

(6) Inputting a screen content image to be evaluated, and inputting texture features, image structural features, definition, text structural features and subjective quality scores of the screen content image to be evaluated into a quality evaluation model after the screen content image to be evaluated is processed in the steps (2), (3) and (4), so as to obtain quality scores;

wherein, the step (2) specifically comprises the following steps: first, a first threshold is dynamically set byFind all the maximum stable extremum regions, where Q _i Representing the first threshold value asin a certain communication region in i, Δ represents a slight first threshold change, v (i) is a region Q when the first threshold is i _i Rate of change of v _(i) If the value is smaller than the given first threshold value, the region Qi is considered as the maximum stable extremum region; secondly, setting a second threshold value of the eccentricity of an ellipse with the same standard second-order central moment as the region, a third threshold value of Euler number, a fourth threshold value and a fifth threshold value of the ratio of the number of pixels in the region to the total pixels in the boundary box, and a sixth threshold value of the proportion of pixels in the convex hull, calculating the eccentricity of the ellipse with the same standard second-order central moment as the region in the maximum stable extremum region, the Euler number, the ratio of the number of pixels in the region to the total pixels in the boundary box, and the proportion of pixels in the convex hull, and determining the first text region when the calculated eccentricity of the ellipse with the same standard second-order central moment as the region is larger than the second threshold value and the Euler number is smaller than the third threshold value, and the proportion of the number of pixels in the region to the total pixels in the boundary box is lower than the fourth threshold value or larger than the fifth threshold value and the sixth threshold value; then, setting a seventh threshold value of the stroke width change rate, calculating the stroke width change rate of the first text region, when the change rate is larger than the seventh threshold value, confirming a second text region, finally, extracting and combining all the second text regions with the maximum stable extremum regions as text regions, and combining the rest regions of the screen content image as image regions;

the step (3) specifically comprises the following steps:

s1: extracting texture features of an image region, firstly, calculating a gradient map of the image region through a Scharr operator, and normalizing the gradient map:wherein [ among others ]]To get the whole operation, g _max The maximum value of the original gradient value is L, and the normalized maximum gray level number is L; then, the gray scale map of the image area is normalized:f _max is the gray maximum value in the original gray map; then, constructing a gray-gradient co-occurrence matrix M, horizontally increasing the matrix to be a gradient value, vertically increasing the matrix to be a gray value, and positioning an origin at a sitting vertex of the matrix; m is defined as M (i, j) = # g (M, N) = i, f (M, N) = j, m=0, 1, 2..m-1, n=0, 1, 2..n-1 }, where M x N is the size of the gradient and gray map, # { } is expressed as the number of elements in the set, and finally extracting the statistical features of the gray-gradient co-occurrence matrix comprises: gradient entropy->Gray entropyEnergy->Gray scale mean valueGradient mean->Standard deviation of gradientGray standard deviation->As a texture feature of the image area, wherein the total number of occurrences of (i, j) is normalized to the probability of occurrence P (i, j);

s2: extracting image structure characteristics of an image area, firstly, partitioning the image area into n multiplied by n partial image blocks with equal size, carrying out partial two-dimensional discrete cosine transform on each image block to obtain DCT coefficients, then fitting the DCT coefficients by using a generalized Gaussian distribution model, obtaining shape parameters gamma of the image blocks after fitting, and obtaining the minimum 10 percent of the shape parameters gammaThe average value of the gamma values is used as a first structural feature, the average value of all the gamma values is used as a second structural feature, and then the frequency change coefficient is calculatedWhere σ|X| is the variance of the block, μ|X| is the mean of the block, taken +.>As a third structural feature, the maximum 10% of the mean value of>As a fourth structural feature, then, to acquire direction information from the partial image block, the DCT coefficient block is divided into low, medium, and high 3 frequency bands, and then the average energy in each frequency band is calculated: />Wherein n is a positive integer, sigma _n For the variance of band n, the ratio of sub-band energies is calculated: />R is taken _n Taking the highest 10% of the mean value of R as the fifth structural feature _n As a sixth structural feature, finally, in order to extract the direction information, the DCT coefficients are divided into 3 parts in 3 directions according to the vertical direction of the radial frequency variation, and then the frequency variation coefficients +_ in 3 directions are calculated>Calculate->The variance of (2), the mean of the highest 10% of the differences was taken as the seventh structural feature, and +.>As an eighth structural feature.

2. The method of evaluating the image quality of screen contents according to claim 1, wherein said step (4) comprises the steps of:

s1: extracting the definition of a text region, firstly, filtering in x and y directions, normalizing the filtered image compared with the maximum value in the filtered image, and when the normalized pixel point value is larger than a preset threshold value, taking the pixel point as a possible edge pixel, and then calculating the difference delta DoM of median filtering image difference in the horizontal direction and the vertical direction respectively, wherein the horizontal direction is as follows: ΔDoM _x (i，j)＝[I _M (i+2，j)-I _M (i，j)]-[I _M (i，j)-I _M (i-2，j)]Vertical direction: ΔDoM _y (i，j)＝[I _M (i，j+2)-I _M (i，j)]-[I _M (i，j)-I _M (i，j-2)]Wherein I _M (i, j) is the gray value of the median filtered image at pixel (i, j), using the difference of deviation 2, the sharpness in the x-direction at pixel (i, j) is defined as:the same applies to the definition calculation in the y-direction, wherein Σ _{i-w≤k≤i-w} |ΔDoM _x (k, j) indicates summing ΔDoM over a window of size 2w+1, normalizing contrast at edges, Σ _{i-w≤k≤i+w} I (k, j) -I (k-1, j) is the contrast at a window size of 2w+1, when S _x If (i, j) is greater than the preset threshold, the pixel point at (i, j) is clear, and finally, the definition of the image of the region is defined as: />Wherein:# sharpPixels is the number of clear pixels and# edgePixels is the number of edge pixels；

S2: extracting text structural features of a text region, firstly, calculating a gradient map GM of the text region, and calculating gradients at image pixels (i, j) as follows:wherein the method comprises the steps ofh represents the gradient operator,>representing convolution operation; a local binary pattern LBP of rotation invariance is then calculated on the gradient map,where delta represents a unified metric, U represents the number of neighboring pixels, S represents the radius value of the field, ρ is defined as a threshold function,G _k ，G _C the GM values, expressed as center coordinates and their neighborhoods, are then computed, and it is observed that GMLBP may contain u+2 different modes, which may be combined into one bin of the histogram, set U to 8, so that the histogram has 10 bins in total, and are computed separately at three scales, the original image, the downsampled image with a downsampling factor of 2, and the downsampled image with a downsampling factor of 4, so that 30 text structural features are extracted in total.