CN117152409A

CN117152409A - Image clipping method, device and equipment based on multi-mode perception modeling

Info

Publication number: CN117152409A
Application number: CN202310991234.6A
Authority: CN
Inventors: 黎蕴玉; 李小青; 丁小波; 蔡茂贞; 钟地秀; 彭琨; 臧文静; 赖俊滔; 黄珊珊
Original assignee: China Mobile Internet Co Ltd
Current assignee: China Mobile Internet Co Ltd
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-12-01

Abstract

The disclosure provides an image clipping method, device and equipment based on multi-mode perception modeling, wherein the method comprises the following steps: acquiring an aesthetic attribute data set, an aesthetic composition data set, an image clipping data set and a clipping history data set, training an initial image clipping model according to the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set to obtain a target image clipping model, and processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain the target image. By implementing the method disclosed by the invention, the model training effect of the initial image clipping model can be effectively improved based on multi-mode data, so that the image clipping performance of the obtained target image clipping model is improved, and the aesthetic property of the picture obtained by clipping the obtained target image clipping model is ensured to meet the personalized requirement of a user.

Description

Image clipping method, device and equipment based on multi-mode perception modeling

Technical Field

The disclosure relates to the technical field of image processing, in particular to an image clipping method, device and equipment based on multi-mode perception modeling.

Background

The intelligent image clipping refers to that a computer automatically clips a shot picture according to aesthetic rules, redundant parts are clipped under the condition of keeping main content, the integral composition of the image is optimized, and the aesthetic quality of the image is improved. The technology can be applied to scenes with high requirements on picture aesthetics, such as application scenes of generation of posters, generation of image thumbnails, production of album covers and the like. Image aesthetics is an abstract perception concept that requires some artistic and aesthetic repair to evaluate.

In the related art, when intelligent image clipping is performed, modeling is generally performed based on a single mode, so that clipping performance of an obtained model is poor, and aesthetic properties of a clipped picture cannot be guaranteed to meet personalized requirements of users.

Disclosure of Invention

The present disclosure aims to solve, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present disclosure is to provide an image clipping method, apparatus, computer device and storage medium based on multi-modal perceptual modeling, which can effectively improve a model training effect for an initial image clipping model based on multi-modal data, thereby improving image clipping performance of an obtained target image clipping model, and ensuring that aesthetic properties of a picture obtained by clipping the obtained target image clipping model can meet personalized requirements of users.

To achieve the above object, an image cropping method based on multi-modal perceptual modeling according to an embodiment of a first aspect of the present disclosure includes:

acquiring an aesthetic attribute data set, an aesthetic composition data set, an image clipping data set and a clipping history data set;

training an initial image cropping model according to the aesthetic attribute data set, the aesthetic composition data set, the image cropping data set and the cropping history data set to obtain a target image cropping model; and

and processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain a target image.

To achieve the above object, an image cropping device based on multi-modal perceptual modeling according to an embodiment of a second aspect of the present disclosure includes:

the acquisition module is used for acquiring an aesthetic attribute data set, an aesthetic composition data set, an image clipping data set and a clipping history data set;

the model training module is used for training an initial image clipping model according to the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set so as to obtain a target image clipping model; and

The processing module is used for processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain a target image.

Embodiments of the third aspect of the present disclosure provide a computer device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed implements an image cropping method based on multimodal perceptual modeling as set forth in the embodiments of the first aspect of the disclosure.

An embodiment of a fourth aspect of the present disclosure proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image cropping method based on multimodal perceptual modeling as proposed by an embodiment of the first aspect of the present disclosure.

A fifth aspect embodiment of the present disclosure proposes a computer program product which, when executed by a processor, performs an image cropping method based on multimodal perceptual modeling as proposed by the first aspect embodiment of the present disclosure.

According to the image clipping method, the device, the computer equipment and the storage medium based on the multi-mode perception modeling, the initial image clipping model is trained according to the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set by acquiring the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set so as to obtain the target image clipping model, the initial image is processed according to the target image clipping model so as to obtain target clipping area information corresponding to the initial image, wherein the target clipping area information is used for clipping the initial image so as to obtain the target image, therefore, the model training effect of the initial image clipping model can be effectively improved based on the multi-mode data, the image clipping performance of the obtained target image clipping model is improved, and the aesthetic property of a picture obtained by clipping the obtained target image clipping model can meet the personalized requirements of users.

Additional aspects and advantages of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of an image cropping method based on multi-modal perceptual modeling according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of an image cropping method based on multimodal perception modeling according to another embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a coding module according to the present disclosure;

FIG. 4 is a schematic structural view of feature fusion proposed in accordance with the present disclosure;

FIG. 5 is a flow chart of an image cropping method based on multimodal perception modeling according to another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of adaptive weighting proposed in accordance with the present disclosure;

FIG. 7 is a schematic diagram of a clipping reasoning process proposed in accordance with the present disclosure;

fig. 8 is a schematic structural diagram of a cloud image clipping method based on multi-modal perceptual modeling according to the present disclosure;

FIG. 9 is a schematic structural diagram of an image cropping device based on multi-modal perceptual modeling according to an embodiment of the present disclosure;

FIG. 10 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present disclosure and are not to be construed as limiting the present disclosure. On the contrary, the embodiments of the disclosure include all alternatives, modifications, and equivalents as may be included within the spirit and scope of the appended claims.

Fig. 1 is a flowchart of an image cropping method based on multi-modal perceptual modeling according to an embodiment of the present disclosure.

It should be noted that, the execution body of the image clipping method based on the multi-mode perception modeling in this embodiment is an image clipping device based on the multi-mode perception modeling, and the device may be implemented in a software and/or hardware manner, and the device may be configured in a computer device, where the computer device may include, but is not limited to, a terminal, a server, and the like, and the terminal may be, for example, a mobile phone, a palm computer, and the like.

As shown in fig. 1, the image clipping method based on multi-modal perceptual modeling includes:

S101: an aesthetic property dataset, an aesthetic composition dataset, an image cropping dataset, and a cropping history dataset are obtained.

Wherein, aesthetic property data set refers to a set of values or labels that describe and evaluate the aesthetic features of a picture.

Wherein an aesthetic composition dataset, which may refer to a dataset made up of elements related to arrangement, balance, symmetry, line organization, etc. of an image, may be used to describe the visual structure and layout of the image.

The image cropping data set may be a data set formed by data related to image cropping.

The clipping history data set may be a data set formed by related data of image clipping performed by the user in a history period.

In the embodiment of the disclosure, when the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set are acquired, the communication link between the execution main body and the big data server of the embodiment of the disclosure may be established in advance, and then the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set are acquired from the big data server, or the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set may be acquired based on a third party data generating device, which is not limited.

In embodiments of the present disclosure, when the aesthetic property dataset, the aesthetic composition dataset, the image cropping dataset, and the cropping history dataset are obtained, a multi-dimensional data support may be provided for the training process of the subsequent initial image cropping model.

S102: training an initial image cropping model according to the aesthetic attribute data set, the aesthetic composition data set, the image cropping data set and the cropping history data set to obtain a target image cropping model.

The initial image clipping model may refer to an untrained image clipping model in an initial state. The target image clipping model may refer to an image clipping model obtained by training the initial image clipping model through the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set.

That is, in the embodiment of the present disclosure, after the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set, and the clipping history data set are acquired, the initial image clipping model may be trained according to the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set, and the clipping history data set, so as to implement multi-modal-based perception modeling, thereby ensuring reliability of the model training process.

S103: and processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain the target image.

The initial image may refer to an image to be subjected to clipping processing. The target cropping zone information may be used to indicate the cropping process to which the initial image corresponds.

The target image is an image obtained by clipping the initial image through the target clipping region information.

In this embodiment, by acquiring the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set, training the initial image clipping model according to the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set to obtain the target image clipping model, and processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain the target image, thereby, the model training effect of the initial image clipping model can be effectively improved based on the multi-mode data, the image clipping performance of the obtained target image clipping model can be improved, and the aesthetic property of the obtained picture clipped by the obtained target image clipping model can be ensured to meet the personalized requirements of users.

The embodiment of the disclosure also provides an image clipping method based on multi-modal perceptual modeling, wherein the aesthetic property dataset comprises: the system comprises a first training image and a first test image corresponding to the first training image, wherein the first training image is provided with corresponding aesthetic evaluation information and aesthetic attribute information, the aesthetic evaluation information is used for evaluating aesthetic dimension of the first training image, and the aesthetic attribute information comprises: at least one of interest content attribute, image main body attribute and image illumination attribute, thereby effectively improving the model training effect of the aesthetic attribute data set on the initial image clipping model.

The first training image is an image used for model training in the aesthetic attribute data set. And the first test image may refer to an image of the aesthetic property dataset that is used to conduct the model test.

The aesthetic evaluation information may be, for example, an aesthetic evaluation score of the pointer to the first training image.

For example, in the embodiment of the present disclosure, the aesthetic property set may include a training picture and a test picture, where each picture corresponds to a mean subjective opinion MOS score, and multiple properties such as interest content, key subjects, and illumination condition in the picture are also marked. The dataset is for aesthetic property branching.

The disclosed embodiments also propose an image cropping method based on multimodal perceptual modeling, wherein the aesthetic composition dataset comprises: the second training image and the second test image corresponding to the second training image are provided with corresponding reference composition information, so that the indication effect of the aesthetic composition data set in the model training process can be effectively improved.

Wherein the second training image may refer to an image included in the aesthetic composition dataset for model training. The second test image may then refer to the image of the aesthetic composition dataset that was used to conduct the model test.

Wherein the reference composition information may refer to reference information of the second training image related to the aesthetic composition.

For example, the aesthetic composition dataset may contain training pictures and test pictures, each picture corresponding to an annotation of 1 to 3 composition rules, where the composition rules have diagonal lines, centers, levels, etc. The dataset is used for aesthetic composition branching.

The embodiment of the disclosure also provides an image clipping method based on multi-modal perceptual modeling, wherein the image clipping dataset comprises: the third training image and the third test image corresponding to the third training image, the third training image having corresponding label cropping zone information, whereby the image cropping dataset may provide reliable reference information for the training process of the initial image cropping model.

The third training image may be an image for performing model training included in the image cropping dataset. The third test image may then refer to the image in the image cropping dataset that is used to perform the model test.

The labeling clipping region information may refer to an image clipping region labeled in advance in the third training image.

For example, the image crop data set may include training pictures and test pictures, each corresponding to an optimal crop box label. The dataset is used for image cropping branches.

The embodiment of the disclosure also provides an image clipping method based on multi-modal perceptual modeling, wherein the clipping history data set comprises: a pair of cropped images and a reference cropping feature corresponding to the pair of cropped images; wherein cropping the image pair comprises: a pre-cut image and a post-cut image; cropping the image features includes: the clipping proportion information, the image editing times and the behavior characteristic information corresponding to the clipping behavior of the user are at least one item, so that the clipping historical data set can accurately indicate the historical clipping characteristics of the user, and the suitability between the clipping model of the target image obtained through training and the user is ensured.

For example, in the embodiment of the present disclosure, clipping history data of a cloud disk user may be obtained as a clipping history data set, where the clipping history data of the cloud disk user refers to a history record of obtaining image clipping of a stored image by the user in the cloud disk, including: a clipping picture pair formed by a pre-clipping picture and a post-clipping picture, a clipping proportion (2:3, 3:4 and the like) used, the total number of times of picture editing and the like. The user clipping behavior records are converted into one-hot characteristics, and the one-hot characteristics are input into a network for learning, so that clipping prediction of a model can fit clipping preference of different users. The dataset is used for image cropping branches.

For example, example content of the aesthetic property dataset, the aesthetic composition dataset, the image cropping dataset, and the cropping history dataset may be as shown in table 1:

TABLE 1

Fig. 2 is a flow chart illustrating an image cropping method based on multi-modal perceptual modeling according to another embodiment of the present disclosure.

As shown in fig. 2, the image clipping method based on multi-modal perceptual modeling includes:

s201: an aesthetic property dataset, an aesthetic composition dataset, an image cropping dataset, and a cropping history dataset are obtained.

The description of S201 may be specifically referred to the above embodiments, and will not be repeated here.

S202: inputting the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set into a feature extraction module to obtain target clipping features output by the feature extraction module, wherein the target clipping features comprise: aesthetic attribute features associated with the aesthetic attribute dataset, aesthetic composition features associated with the aesthetic composition dataset, reference cropping features associated with the image cropping dataset and the cropping history dataset.

The feature extraction module may be a module for performing feature extraction, which is pre-constructed in the initial image clipping model. The target clipping feature may be clipping features extracted by the feature extraction module processing the aesthetic property dataset, the aesthetic composition dataset, the image clipping dataset and the clipping history dataset.

The aesthetic attribute feature may refer to feature information related to the aesthetic attribute extracted by the feature extraction module. Aesthetic composition characteristics can refer to characteristic information related to the aesthetic composition extracted by the characteristic extraction module. The reference clipping feature may refer to the clipping-related feature information extracted by the feature extraction module.

Optionally, in some embodiments, the feature extraction module includes: the aesthetic adapter, the composition adapter and the plurality of coding modules are connected in series, the input data and the output data of the coding modules are spliced and processed to be used as the input data of the next coding module, when the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set are input into the feature extraction module to obtain the target clipping feature output by the feature extraction module, the aesthetic attribute data set can be input into the aesthetic adapter to obtain first output data, the aesthetic attribute data set is input into the plurality of coding modules to obtain second output data, the first output data and the second output data are jointly used as aesthetic attribute features, the aesthetic composition data set is input into the composition adapter to obtain third output data, the aesthetic composition data set is input into the plurality of coding modules to obtain fourth output data, the third output data and the fourth output data are jointly used as the composition feature, the image data set and the clipping history data set are input into the plurality of coding modules to obtain fifth output data, the fifth output data is obtained, and the fifth output data is taken as the reference clipping feature, therefore the aesthetic feature can be extracted by the effective feature extraction module, and the feature extraction function of the effective feature extraction module can be improved.

The aesthetic adapter can be used for users to specify the requirements of the color, contrast, brightness, filter effect and the like of the image, so that the generated image meets the expectations of the users. For example, in artistic creation, aesthetic adapters may be used to simulate the styles of different painters; in product design, aesthetic adapters may be used to ensure that the generated image is consistent with the brand image.

Wherein the composition adapter may generate specific control instructions or markers informing the model how to adjust the composition of the image. The model can be correspondingly processed and adjusted according to the composition requirements set by the user, so that more personalized image results meeting the requirements of the user are generated.

That is, in the embodiment of the present disclosure, the self-configured CViT-BERT backbone network may be used as the feature extraction module, where the module includes an aesthetic adapter, a composition adapter, and a plurality of coding modules adapted to different input data, where the input and the output of the coding module are spliced to be used as the input of the next coding module, so as to improve the feature transmission efficiency.

For example, as shown in fig. 3, fig. 3 is a schematic structural diagram of an encoding module according to the present disclosure, wherein different modal features are processed and enter a sharing module, the sharing module is formed by stacking a normalization layer, a multi-head self-attention layer and a perception layer, and the perception layer includes two fully connected layers with GELU activation functions. (1) When the input data is an image, the convolution layers with different convolution kernel sizes are used for extracting different scale features and image locality features, and the different scale features are spliced and then input into the pooling layer, so that the scale invariance of the image is modeled. (2) When the input data is an attribute description text, the text S is divided into M segments, embedded by a word embedding matrix T, and the position embedding Tpos and the segment embedding Tseq of the text are spliced, and s= [ T1 t... (3) When the input data is the clipping records of the user, the records are converted into ont-hot characteristics for splicing and normalizing, and then the ont-hot characteristics are used as the clipping behavior characteristics of the user for input.

S203: the aesthetic attribute features, the reference cut features, and the features of the coded descriptive parameters associated with the feature extraction module are input into the aesthetic module to cause the aesthetic module to determine predicted aesthetic assessment information, wherein the aesthetic assessment information and the predicted aesthetic assessment information corresponding to the first training image are used to determine the first loss information.

Among them, the aesthetic module may refer to a functional module used to predict aesthetic evaluation information of an image.

Wherein, the aesthetic evaluation information is predicted, which may refer to the aesthetic evaluation information obtained by prediction based on the aesthetic module.

Wherein, the coding description parameters can refer to parameters output by inputting aesthetic attribute data sets to a plurality of coding modules.

The first loss information refers to loss information which is determined based on aesthetic evaluation information and predictive aesthetic evaluation information and is used for performing iterative training on the initial image clipping model.

That is, the network output portion in embodiments of the present disclosure may contain aesthetic property branches consisting of aesthetic property data inputs, CViT-BERT backbone networks, aesthetic modules, and Huber loss functions. The attribute labels in the pictures are organized into the aesthetic description text of the pictures, and the aesthetic description text and the aesthetic label pictures are added into the branches for training, so that MOS aesthetic scores of the pictures can be predicted together.

Optionally, in some embodiments, the aesthetic module is configured to extract attribute semantic information from the aesthetic attribute features, extract coding semantic information from the features of the coding description parameters related to the feature extraction module, extract visual feature information from the reference clipping feature, and determine predicted aesthetic evaluation information according to the attribute semantic information, the coding semantic information, and the visual feature information, thereby effectively improving the processing effects of the aesthetic module on the aesthetic attribute features, the coding description parameters, and the reference clipping feature, and ensuring the accuracy of the obtained predicted aesthetic evaluation information.

The attribute semantic information refers to information obtained by extracting semantic information by taking aesthetic attribute features as texts. The coding semantic information may be information obtained by performing semantic extraction by using the feature of the coding description parameter as a text. Visual feature information may refer to information of visual feature dimensions extracted from the reference cropping feature.

For example, in the embodiment of the present disclosure, the attribute description text of the picture may be formed by attribute labels in series (may be adjusted according to the situation), the attribute expression statement may be as follows, and other description statements may be generated according to templates:

(a) Balance, color harmony, interest content, illumination, three-way rule and color definition in the graph are all good, and shallow depth of field and key main bodies are general.

(b) The balance and color harmony in the graph are good, and the interest content, illumination, three-way rule and color definition are general, and the shallow depth of field and the key main body are poor.

(c) Balance, color harmony and shallow depth of field in the graph are general, interest content, illumination, dichotomy, color definition and key subject are poor.

The attribute description text is converted into a text sequence S1 through a formula 1, and the text sequence S is matched with an image I ₁ Together as an input to the aesthetic property branch. And inputting an aesthetic module after the main network, wherein the aesthetic module is formed by connecting 3*3 convolution and full connection layers in series, and finally accessing a Huber loss function to predict the MOS fraction. Wherein Y 'is' _mo s is the predicted MOS fraction, Y _mos Let e=y for the noted MOS score _mos -Y’ _mos Wherein θ is a parameterFactor (typically 1).

That is, in embodiments of the present disclosure, aesthetic attribute data may be used as input and aesthetic adapter and multiple coding module parameters in the CViT-BERT backbone network, both re-input to the aesthetic module, optimized with huber loss.

The branch can extract semantic information by taking aesthetic attribute information and description of multi-module coding parameters as text information, perform multi-mode learning by combining visual information, and perform regression prediction on aesthetic MOS scores.

S204: the aesthetic composition features, the reference cropping features, and the features of the encoded descriptive parameters associated with the feature extraction module are input into the composition module to cause the composition module to determine predicted composition information, wherein the reference composition information and the predicted composition information are used to determine second loss information.

The composition module may refer to a functional module used for processing composition related information in the initial image clipping model.

The composition information is composition information obtained based on composition module prediction.

The second loss information may refer to loss information determined based on the reference composition information and the predicted composition information for performing iterative training on the initial image cropping model.

Optionally, in some embodiments, the composition module is configured to classify the reference composition information according to the aesthetic composition feature, the reference clipping feature and the feature of the coding description parameter related to the feature extraction module, and use the composition information obtained by classification as the predicted composition information, thereby effectively improving applicability of the obtained predicted composition information.

The reference composition information may refer to composition information that is preconfigured to be used as reference information.

That is, in the embodiment of the present disclosure, in aesthetic composition data and image cropping branches, the image cropping set and the user cropping data are respectively used as inputs, two data of a plurality of encoding module parameters (adapter parameter curing) in the CViT-BERT backbone network are learned as inputs, and are input into the composition module, and then, the multi-classification loss function is accessed. The composition rules are classified by utilizing the aesthetic composition principle of the image, and composition information is provided for image clipping.

For example, the composition rule branch consists of aesthetic composition data, CViT-BERT backbone network, composition module, and sort loss function. For aesthetic construction atlas, aesthetic composition guidance can be provided for image cropping by learning composition rules. Let the input picture be I ₂ After the composition module passes through the backbone network, the composition module is input, the module consists of an average pooling layer and a full connection layer, and finally the classification confidence score of the composition rule is output. Since the aesthetic composition data contains at least 1-3 composition rule tags, classification predictions are made using the multi-tag asymmetric loss function. Given K rule tags, the prediction category confidence for the input picture I2 is p= [ p1, ], pk ]Yk is whether k labels are contained in the reality, and the multi-label asymmetric function is as follows, wherein gamma+ and gamma-are optimized positive and negative sample proportion parameters.

S205: the image cropping dataset, the cropping history dataset and the features of the encoding description parameters associated with the feature extraction module are input into the cropping module to enable the cropping module to determine predicted cropping features, wherein the predicted cropping features are used for determining predicted cropping area information.

The clipping module may be a functional module used for predicting clipping related information in the initial image clipping model.

The prediction of the clipping feature may refer to clipping features predicted based on the clipping module.

That is, in the embodiment of the present disclosure, the image clipping set and the user clipping data may be respectively used as input, a plurality of encoding module parameters (adapter parameter curing) in the CViT-BERT backbone network are learned, and then the aesthetic module, the composition module and the clipping module are input in parallel, the multi-modal features are fused based on the LAFF algorithm held by the aesthetic module and the composition module, the clipping frame is predicted by using the adaptive weight by using the frame regression loss function and the significance loss function together optimization.

In the disclosed embodiments, the image cropping branches may be composed of an image cropping set and user cropping history data as inputs, a CViT-BERT backbone network, an aesthetic module, a composition module, a cropping module, a frame regression loss function, and a saliency loss function. The method has the main function of predicting the image cutting frame by fusing the multidimensional features. The input data is subjected to feature extraction through a backbone network, an aesthetic module, a composition module and a cutting module, feature fusion is performed by utilizing LAFF, different feature importance is adaptively learned, a cutting frame loss function and a significance loss function are finally accessed, and multi-dimensional perception information is provided for image cutting training by utilizing visual significance, aesthetic grading, composition rules and user cutting history data.

To more effectively fuse multi-modal perceptual features, the present disclosure introduces a lightweight attention feature fusion LAFF module (Lightweight Attentional Feature Fusion) in the image cropping task, learning cross-modal combining weights for text and vision. As shown in fig. 4, fig. 4 is a schematic structural diagram of feature fusion proposed according to the present disclosure, where k different features { F1..times.dk } with sizes { d 1..times.k }, fk }, in the present disclosure, k=4, the k=4 is made to have the same dimension d by the linear layer, the independent weights of the k features are obtained by using softmax, and finally the feature F is obtained by weighted fusion. The LAFF module adopts less parameter design, improves the calculation efficiency, and simultaneously directly uses the attention weight for the convex combination of the features, so that the aesthetic features, the composition features and the cutting features can be better self-learned and fused.

S206: and performing iterative training on the initial image clipping model at least once according to the first loss information, the second loss information and the predicted clipping region information, and taking the image clipping model obtained by the iterative training as a target image clipping model.

That is, in the embodiment of the present disclosure, the initial image cropping model includes: the device comprises a feature extraction module, an aesthetic module, a composition module and a clipping module, wherein after an aesthetic attribute data set, an aesthetic composition data set, an image clipping data set and a clipping history data set are acquired, the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set can be input into the feature extraction module to obtain target clipping features output by the feature extraction module, and the target clipping features comprise: the method comprises the steps of inputting aesthetic attribute characteristics related to an aesthetic attribute dataset, aesthetic composition characteristics related to an aesthetic composition dataset, reference clipping characteristics related to an image clipping dataset and a clipping history dataset into an aesthetic module, so that the aesthetic module determines predicted aesthetic evaluation information, wherein aesthetic evaluation information and predicted aesthetic evaluation information corresponding to a first training image are used for determining first loss information, inputting aesthetic composition characteristics, reference clipping characteristics and characteristics of encoding description parameters related to the characteristic extraction module into a composition module, so that the composition module determines predicted composition information, wherein the reference composition information and the predicted composition information are used for determining second loss information, inputting characteristics of the image clipping dataset, the clipping history dataset and encoding description parameters related to the characteristic extraction module into the clipping module, so that the clipping module determines predicted characteristics, wherein the predicted clipping characteristics are used for determining predicted clipping area information, performing iterative training on an initial image model according to the first loss information, the second loss information and the predicted clipping area information, and clipping iteration training the initial image model at least once, and taking the clipping image as a training model, thereby improving the reliability and reliability of the obtained training image model, and ensuring reliability and reliability of the obtained training image model.

S207: and processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain the target image.

The description of S207 may be specifically referred to the above embodiments, and will not be repeated here.

In this embodiment, the first output data is obtained by inputting the aesthetic attribute data set into the aesthetic adapter, and the aesthetic attribute data set is input into the plurality of encoding modules to obtain the second output data, where the first output data and the second output data are used together as the aesthetic attribute feature, the aesthetic composition data set is input into the composition adapter to obtain the third output data, and the aesthetic composition data set is input into the plurality of encoding modules to obtain the fourth output data, where the third output data and the fourth output data are used together as the aesthetic composition feature, the image cropping data set and the cropping history data set are input into the plurality of encoding modules to obtain the fifth output data, and the fifth output data is used as the reference cropping feature. The aesthetic module is used for extracting attribute semantic information from the aesthetic attribute features, extracting coding semantic information from the features of the coding description parameters related to the feature extraction module, extracting visual feature information from the reference clipping features, and determining prediction aesthetic evaluation information according to the attribute semantic information, the coding semantic information and the visual feature information, so that the processing effects of the aesthetic module on the aesthetic attribute features, the coding description parameters and the reference clipping features can be effectively improved, and the accuracy of the obtained prediction aesthetic evaluation information is ensured. The composition module is used for classifying the reference composition information according to the aesthetic composition characteristics, the reference cutting characteristics and the characteristics of the coding description parameters related to the characteristic extraction module, and taking the composition information obtained by classification as the prediction composition information, so that the applicability of the obtained prediction composition information can be effectively improved. Inputting the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set into a feature extraction module to obtain target clipping features output by the feature extraction module, wherein the target clipping features comprise: the method comprises the steps of inputting aesthetic attribute characteristics related to an aesthetic attribute dataset, aesthetic composition characteristics related to an aesthetic composition dataset, reference clipping characteristics related to an image clipping dataset and a clipping history dataset into an aesthetic module, so that the aesthetic module determines predicted aesthetic evaluation information, wherein aesthetic evaluation information and predicted aesthetic evaluation information corresponding to a first training image are used for determining first loss information, inputting aesthetic composition characteristics, reference clipping characteristics and characteristics of encoding description parameters related to the characteristic extraction module into a composition module, so that the composition module determines predicted composition information, wherein the reference composition information and the predicted composition information are used for determining second loss information, inputting characteristics of the image clipping dataset, the clipping history dataset and encoding description parameters related to the characteristic extraction module into the clipping module, so that the clipping module determines predicted characteristics, wherein the predicted clipping characteristics are used for determining predicted clipping area information, performing iterative training on an initial image model according to the first loss information, the second loss information and the predicted clipping area information, and clipping iteration training the initial image model at least once, and taking the clipping image as a training model, thereby improving the reliability and reliability of the obtained training image model, and ensuring reliability and reliability of the obtained training image model.

Fig. 5 is a flow chart illustrating an image cropping method based on multi-modal perceptual modeling according to another embodiment of the present disclosure.

As shown in fig. 5, the image cropping method based on multi-modal perceptual modeling includes:

s501: an aesthetic property dataset, an aesthetic composition dataset, an image cropping dataset, and a cropping history dataset are obtained.

S502: inputting the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set into a feature extraction module to obtain target clipping features output by the feature extraction module, wherein the target clipping features comprise: aesthetic attribute features associated with the aesthetic attribute dataset, aesthetic composition features associated with the aesthetic composition dataset, reference cropping features associated with the image cropping dataset and the cropping history dataset.

S503: the aesthetic attribute features, the reference cut features, and the features of the coded descriptive parameters associated with the feature extraction module are input into the aesthetic module to cause the aesthetic module to determine predicted aesthetic assessment information, wherein the aesthetic assessment information and the predicted aesthetic assessment information corresponding to the first training image are used to determine the first loss information.

S504: the aesthetic composition features, the reference cropping features, and the features of the encoded descriptive parameters associated with the feature extraction module are input into the composition module to cause the composition module to determine predicted composition information, wherein the reference composition information and the predicted composition information are used to determine second loss information.

S505: the image cropping dataset, the cropping history dataset and the features of the encoding description parameters associated with the feature extraction module are input into the cropping module to enable the cropping module to determine predicted cropping features, wherein the predicted cropping features are used for determining predicted cropping area information.

The descriptions of S501-S505 may be specifically referred to the above embodiments, and are not repeated herein.

S506: and inputting the predicted cutting features, the aesthetic attribute features, the aesthetic composition features and the features of the coding description parameters related to the feature extraction module into a feature fusion module so that the feature fusion module performs feature fusion to obtain target fusion features, wherein the target fusion features are used for determining the predicted cutting region information.

The feature fusion module may be a module that is used to perform feature fusion processing on the predicted cut feature, the aesthetic attribute feature, the aesthetic composition feature, and the feature of the coding description parameter related to the feature extraction module.

The target fusion feature may be a feature obtained after the feature fusion module performs the feature fusion processing. The prediction clipping region information may refer to clipping region information predicted based on the target fusion feature.

That is, in the embodiment of the present disclosure, the initial image cropping model further includes: and after the feature fusion module acquires the predicted cutting feature, the aesthetic attribute feature, the aesthetic composition feature and the feature of the coding description parameter related to the feature extraction module, the feature of the predicted cutting feature, the aesthetic attribute feature, the aesthetic composition feature and the feature of the coding description parameter related to the feature extraction module can be input into the feature fusion module, so that the feature fusion module performs feature fusion to obtain a target fusion feature, wherein the target fusion feature is used for determining the information of the predicted cutting area, and therefore, effective fusion processing of the predicted cutting feature, the aesthetic attribute feature, the aesthetic composition feature and the feature of the coding description parameter related to the feature extraction module can be realized, and the comprehensiveness of the consideration in the process of acquiring the target fusion feature is effectively improved.

S507: and inputting the target fusion characteristic into the self-adaptive weighting module to obtain the self-adaptive weighting characteristic corresponding to the target fusion characteristic output by the self-adaptive weighting module.

The adaptive weighting module may be a module which is preconfigured in the initial image clipping model and is used for performing adaptive weighting processing on the target fusion features. The adaptive weighting feature may refer to feature information obtained by processing the target fusion feature through the adaptive weighting module.

That is, in the embodiment of the present disclosure, an adaptive weighting module may be preconfigured in the initial image clipping model, so as to perform adaptive weighting processing on the target fusion feature, thereby providing reliable data support for subsequently generating the predicted clipping region information.

For example, in the embodiment of the disclosure, the fused features may be up-sampled to obtain a probability thermodynamic diagram H with the same spatial dimension as the input image. Referring to the gesture estimation algorithm, the derived adaptive weighting module (Adaptive Weighting Regression) is used to aggregate information on the dense features (probability thermodynamic diagram H) of the network predictions, as shown in fig. 6, fig. 6 is an adaptive weighting schematic diagram proposed according to the present disclosure, predicting a crop box. For each original image pixel point pi, the coordinate Y of the label cutting frame _box The predicted offset is Oi, w (pi) is the weight of original image pixel pi after thermodynamic diagram H normalization, and n is the pixelNumber, last predicted crop frame boundary coordinate Y' _box Is thatFinally, calculating a smooth L1 loss between the predicted and marked crop frame coordinates:

L _reg ＝∑|Y′ _box -Y _box i (equation 4)

Visual salience refers to a perceptual property in an image that makes objects attract human attention, and the most significant region often corresponds to the most important content in the image, and the most important subject in the image can be protected from being cut by using the perceptual property. The present disclosure designs a new saliency loss function that makes the model more sensitive to salient regions, focusing on salient objects during learning and cropping. Assuming label cutting frame Y _box Is h in height and w in width. First, a Spectral Residual method is adopted to obtain a cutting frame Y _box Is a saliency score matrix c= { ci, j|0<i≤h,0<i is less than or equal to w, and the value range is 0,1]. When the clipping branch predicts the clipping frame, the predicted clipping region feature F= { fi, j|0 is extracted<i≤h’,0<i is less than or equal to w' }, and calculating the standard deviation of the characteristic F as theta _std Substitution into the significance loss function of the design:

for the region of no significance, (1-ci, j) will be larger when θ _std The greater the penalty is when also larger. The opposite is true for the salient region.

S508: and generating prediction clipping region information according to the self-adaptive weighting characteristics.

The prediction clipping region information refers to clipping region information predicted by an initial image clipping model based on self-adaptive weighting characteristics in a model training process.

That is, in the embodiment of the present disclosure, after the target fusion feature is input to the adaptive weighting module to obtain the adaptive weighting feature corresponding to the target fusion feature output by the adaptive weighting module, the prediction clipping region information may be generated according to the adaptive weighting feature.

S509: and determining fourth loss information according to the predicted clipping region information and the marked clipping region information corresponding to the third training image, wherein the fourth loss information is used for performing iterative training on the initial image clipping model at least once.

The fourth loss information may be loss information determined based on the predicted trimming area information and the labeled trimming area information corresponding to the third training image.

That is, in the embodiment of the present disclosure, the initial image cropping model further includes: the adaptive weighting module is used for inputting the target fusion characteristic into the adaptive weighting module after the target fusion characteristic is obtained, so as to obtain the adaptive weighting characteristic corresponding to the target fusion characteristic output by the adaptive weighting module, generating prediction cutting area information according to the adaptive weighting characteristic, determining fourth loss information according to the prediction cutting area information and label cutting area information corresponding to a third training image, and performing at least one iterative training on the initial image cutting model by the fourth loss information, thereby quickly and accurately generating the fourth loss information based on the prediction cutting area information and the label cutting area information and providing reliable data support for the model iterative training process.

For example, in the embodiment of the present disclosure, training may be performed in an end-to-end manner, and an Adam optimizer is used to perform optimization, and three data sets are respectively input into corresponding network branches to perform multi-task serial optimization. In each training iteration process, input data firstly passes through a backbone network, then aesthetic attribute data only passes through an aesthetic module, composition data only passes through a composition module, cutting data and user cutting history data simultaneously pass through the aesthetic module, the composition module and the cutting module, and finally, loss functions of corresponding branches are accessed to optimize. The optimization objective of the overall network is to minimize the objective function l=a ₁ L _mos +a ₂ L _cls +a ₃ L _reg +(1-a ₁ -a _2- a ₃ )L _sal Wherein a is ₁ 、a ₂ And a ₃ Is a balance factor (can be set as a ₁ ＝a ₂ ＝0.2，a ₃ =0.4) and updates the network parameters with the cumulative gradient strategy.

For the test flow (as shown in fig. 7, fig. 7 is a schematic diagram of a clipping reasoning process proposed according to the disclosure), clipping history data of a new user and an image to be clipped are taken as input, and then aesthetic, composition and clipping modules are input in parallel through a backbone network, feature weighting is performed, a prediction clipping frame is obtained (no significance prediction is needed), and a clipping result is output.

S510: and performing iterative training on the initial image clipping model at least once according to the first loss information, the second loss information, the third loss information and the fourth loss information, and taking the image clipping model obtained by the iterative training as a target image clipping model.

That is, in the embodiment of the present disclosure, after the fourth loss information is determined, at least one iterative training may be performed on the initial image clipping model according to the first loss information, the second loss information, the third loss information, and the fourth loss information, and the image clipping model obtained by the iterative training is used as the target image clipping model, so that reliability and robustness of the model training process may be effectively improved based on the loss information with multiple dimensions.

S511: and processing the initial image according to the target image clipping model to obtain target clipping region information corresponding to the initial image, wherein the target clipping region information is used for clipping the initial image to obtain the target image.

The description of S511 may be specifically referred to the above embodiments, and will not be repeated here.

In this embodiment, the feature fusion module performs feature fusion to obtain the target fusion feature by inputting the predicted cutting feature, the aesthetic attribute feature, the aesthetic composition feature and the feature of the coding description parameter related to the feature extraction module to the feature fusion module, where the target fusion feature is used to determine the predicted cutting region information, so that effective fusion processing of the predicted cutting feature, the aesthetic attribute feature, the aesthetic composition feature and the feature of the coding description parameter related to the feature extraction module can be achieved, thereby effectively improving the comprehensiveness of the consideration in the process of obtaining the target fusion feature. The target fusion characteristic is input into the self-adaptive weighting module, so that the self-adaptive weighting characteristic corresponding to the target fusion characteristic output by the self-adaptive weighting module is obtained, prediction cutting area information is generated according to the self-adaptive weighting characteristic, fourth loss information is determined according to the prediction cutting area information and labeling cutting area information corresponding to a third training image, and the fourth loss information is used for carrying out at least one iteration training on an initial image cutting model, so that fourth loss information can be quickly and accurately generated based on the prediction cutting area information and the labeling cutting area information, and reliable data support can be provided for a model iteration training process. At least one iteration training is carried out on the initial image clipping model according to the first loss information, the second loss information, the third loss information and the fourth loss information, and the image clipping model obtained through the iteration training is used as a target image clipping model, so that the reliability and the robustness of the model training process can be effectively improved based on the loss information with multiple dimensions.

For example, as shown in fig. 8, fig. 8 is a schematic structural diagram of a cloud image clipping method based on multi-modal sensing modeling according to the present disclosure. The system has the advantages that the brand-new network structure is designed, the visual mode characteristics, the language mode characteristics and the cloud disk user cutting record characteristics are fused, the visual saliency, the aesthetic attributes and the composition rules are utilized to perform intelligent image cutting together, the system performance is improved by utilizing the complementary information among different modes, and meanwhile, the cutting of the model can be better fit with the cutting preference of different users.

The backbone network is comprised of an aesthetic adapter, a composition adapter, and a plurality of coding modules, wherein the coding modules are connected in series. The input data respectively enter the aesthetic adapter, the composition adapter and the first coding module, and after calculation, the output characteristics of the last coding module are added with the output characteristics of the aesthetic adapter and the output characteristics of the composition adapter.

And adding the input and the output of the last coding module by the self-constructed CViT-BERT main network and then taking the added input and the output as the input of the next coding module, so that the characteristics are effectively transferred, the training speed is increased, and the degradation problem of the depth network is relieved. In addition, to better preserve certain task (aesthetic/composition) features and avoid interference from other tasks, aesthetic adapters and composition adapters are designed in the network, the adapters consisting of convolution and pooling layers. Updating aesthetic adapter parameters during aesthetic task training, and solidifying the aesthetic adapter parameters during other tasks to only output features; the same applies to the composition adapter.

Fig. 9 is a schematic structural diagram of an image cropping device based on multi-modal perceptual modeling according to an embodiment of the present disclosure.

As shown in fig. 9, the image cropping device 90 based on the multi-modal perceptual modeling includes:

an acquisition module 901 for acquiring an aesthetic property dataset, an aesthetic composition dataset, an image cropping dataset, and a cropping history dataset;

a model training module 902 for training an initial image cropping model according to the aesthetic property dataset, the aesthetic composition dataset, the image cropping dataset, and the cropping history dataset to obtain a target image cropping model; and

the processing module 903 is configured to process the initial image according to the target image cropping model to obtain target cropping area information corresponding to the initial image, where the target cropping area information is used to crop the initial image to obtain the target image.

It should be noted that the foregoing explanation of the image clipping method based on the multi-modal sensing modeling is also applicable to the image clipping device based on the multi-modal sensing modeling in this embodiment, and will not be repeated here.

FIG. 10 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present disclosure. The computer device 12 shown in fig. 10 is merely an example and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.

As shown in FIG. 10, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, commonly referred to as a "hard disk drive").

Although not shown in fig. 10, a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments described in this disclosure.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a person to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, the computer device 12 may also communicate with one or more networks such as a local area network (Local Area Network; hereinafter LAN), a wide area network (Wide Area Network; hereinafter WAN) and/or a public network such as the Internet via the network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the image cropping method based on multimodal perception modeling mentioned in the foregoing embodiment.

To achieve the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image cropping method based on multimodal perceptual modeling as proposed in the previous embodiments of the present disclosure.

To achieve the above embodiments, the present disclosure also proposes a computer program product which, when executed by an instruction processor in the computer program product, performs an image cropping method based on multimodal perceptual modeling as proposed in the previous embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

It should be noted that in the description of the present disclosure, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present disclosure, unless otherwise indicated, the meaning of "a plurality" is two or more.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

Furthermore, each functional unit in the embodiments of the present disclosure may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present disclosure have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the present disclosure, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the present disclosure.

Claims

1. An image clipping method based on multi-modal perceptual modeling, the method comprising:

2. The method of claim 1, wherein the aesthetic property dataset comprises: a first training image and a first test image corresponding to the first training image, the first training image having corresponding aesthetic evaluation information for evaluating aesthetic dimensions of the first training image and aesthetic attribute information comprising: at least one of a content of interest attribute, an image subject attribute, and an image illumination attribute.

3. The method of claim 1, wherein the aesthetic composition dataset comprises: a second training image and a second test image corresponding to the second training image, the second training image having corresponding reference composition information.

4. The method of claim 1, wherein the image cropping dataset comprises: the system comprises a third training image and a third test image corresponding to the third training image, wherein the third training image is provided with corresponding label clipping area information.

5. The method of claim 1, wherein the clipping history data set comprises: a pair of cropped images and a reference crop feature corresponding to the pair of cropped images; wherein the cropped image pair comprises: a pre-cut image and a post-cut image; the cropping image features include: at least one item of clipping proportion information, image editing times and behavior characteristic information corresponding to clipping behaviors of a user.

6. The method of any of claims 1-5, wherein the initial image cropping model comprises: the device comprises a feature extraction module, an aesthetic module, a composition module and a cutting module;

wherein said training an initial image cropping model based on said aesthetic property dataset, said aesthetic composition dataset, said image cropping dataset, and said cropping history dataset to obtain a target image cropping model comprises:

Inputting the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set into the feature extraction module to obtain target clipping features output by the feature extraction module, wherein the target clipping features comprise: aesthetic attribute features associated with the aesthetic attribute dataset, aesthetic composition features associated with the aesthetic composition dataset, reference cropping features associated with the image cropping dataset, and the cropping history dataset;

inputting the aesthetic attribute features, the reference cropping features, and features of the coded descriptive parameters associated with the feature extraction module into the aesthetic module to cause the aesthetic module to determine predicted aesthetic evaluation information, wherein aesthetic evaluation information corresponding to the first training image and the predicted aesthetic evaluation information are used to determine first loss information;

inputting the aesthetic composition characteristics, the reference clipping characteristics and characteristics of coding description parameters related to the characteristic extraction module into the composition module so as to enable the composition module to determine predicted composition information, wherein the reference composition information and the predicted composition information are used for determining second loss information;

Inputting the image cropping dataset, the cropping history dataset and the characteristics of the coding description parameters related to the characteristic extraction module into the cropping module so as to enable the cropping module to determine predicted cropping characteristics, wherein the predicted cropping characteristics are used for determining predicted cropping area information;

and performing iterative training on the initial image clipping model at least once according to the first loss information, the second loss information and the predicted clipping region information, and taking the image clipping model obtained by the iterative training as a target image clipping model.

7. The method of claim 6, wherein the initial image cropping model further comprises: the feature fusion module is respectively connected with the feature extraction module, the aesthetic module, the composition module and the cutting module; the method further comprises the steps of:

and inputting the predicted cutting features, the aesthetic attribute features, the aesthetic composition features and the features of the coding description parameters related to the feature extraction module into the feature fusion module so as to enable the feature fusion module to perform feature fusion to obtain target fusion features, wherein the target fusion features are used for determining predicted cutting region information.

8. The method of claim 7, wherein the initial image cropping model further comprises: an adaptive weighting module; the method further comprises the steps of:

inputting the target fusion characteristic into the self-adaptive weighting module to obtain the self-adaptive weighting characteristic corresponding to the target fusion characteristic output by the self-adaptive weighting module;

generating the prediction clipping region information according to the self-adaptive weighting characteristics;

and determining fourth loss information according to the predicted clipping region information and the marked clipping region information corresponding to the third training image, wherein the fourth loss information is used for performing iterative training on the initial image clipping model at least once.

9. The method of claim 8, wherein the performing at least one iterative training on the initial image cropping model according to the first loss information, the second loss information, and the predicted cropping zone information, and taking an image cropping model obtained by the iterative training as a target image cropping model comprises:

and performing iterative training on the initial image clipping model at least once according to the first loss information, the second loss information, the third loss information and the fourth loss information, and taking the image clipping model obtained by iterative training as a target image clipping model.

10. The method of claim 6, wherein the aesthetic module is configured to extract attribute semantic information from the aesthetic attribute features and code semantic information from features of code description parameters associated with the feature extraction module, extract visual feature information from the reference cut features, and determine the predicted aesthetic evaluation information based on the attribute semantic information, the code semantic information, and the visual feature information.

11. The method of claim 6, wherein the composition module is configured to categorize the reference composition information based on the aesthetic composition features, the reference cropping features, and the features of the encoding description parameters associated with the feature extraction module, and to treat the categorized composition information as the predicted composition information.

12. The method of claim 6, wherein the feature extraction module comprises: the system comprises an aesthetic adapter, a composition adapter and a plurality of coding modules, wherein the coding modules are connected in series, and input data and output data of the coding modules are spliced and then used as input data of the next coding module;

The inputting the aesthetic attribute data set, the aesthetic composition data set, the image clipping data set and the clipping history data set into the feature extraction module to obtain the target clipping feature output by the feature extraction module comprises the following steps:

inputting the aesthetic attribute data set into the aesthetic adapter to obtain first output data, and inputting the aesthetic attribute data set into the plurality of coding modules to obtain second output data, wherein the first output data and the second output data are used together as the aesthetic attribute characteristics;

inputting the aesthetic composition data set into the composition adapter to obtain third output data, and inputting the aesthetic composition data set into the plurality of coding modules to obtain fourth output data, wherein the third output data and the fourth output data are used together as the aesthetic composition characteristics;

and inputting the image clipping data set and the clipping history data set into the plurality of encoding modules to obtain fifth output data, wherein the fifth output data is used as the reference clipping feature.

13. An image cropping device based on multi-modal perceptual modeling, the device comprising:

14. A computer device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

15. A non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are for causing the computer to perform the method of any one of claims 1-12.

16. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1-12.