CN113869099A

CN113869099A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113869099A
Application number: CN202110693496.5A
Authority: CN
Inventors: 徐路; 郭昱宇; 高联丽; 陈敏; 王浩宇
Original assignee: University of Electronic Science and Technology of China; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: University of Electronic Science and Technology of China; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-12-31

Abstract

The present disclosure relates to an image processing method, an apparatus, an electronic device, and a storage medium, the method including: inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed, inputting the object detection information into a visual relationship detection model for visual relationship detection to obtain a visual relationship between every two objects, wherein the visual relationship is obtained by adjusting semantic information quantity corresponding to the visual relationship through the visual relationship detection model, and inputting the visual relationship into a scene graph generation model for scene graph generation to obtain a target scene graph corresponding to the image to be processed. The method is based on the visual relationship detection model, the visual relationship between every two objects is detected, and the accuracy of visual relationship detection can be improved.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

The scene graph marked with the visual relation can be generated through visual relation detection, and the scene graph marked with the visual relation is used as a structural representation of image content and is a bridge between computer vision and natural language. After the scene graph labeled with the visual relationship is generated, the visual relationship triple formed by the subject, the predicate and the object in the image can be detected according to the scene graph.

In the related art, when the visual relationship detection is performed on the image to be processed, the detected visual relationship is easy to be confused, so that the accuracy of the visual relationship detection is reduced, and the effectiveness of the scene graph marked with the visual relationship is also reduced.

Disclosure of Invention

The present disclosure provides an image processing method, an image processing apparatus, an electronic device, and a storage medium, to at least solve the problems of low accuracy of visual relationship detection and low validity of a scene image labeled with a visual relationship in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an image processing method, the method comprising;

inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed respectively;

inputting the object detection information into a visual relation detection model for visual relation detection to obtain a visual relation between every two objects, wherein the visual relation represents an interactive relation between every two objects in the image to be processed;

and inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model for scene graph generation to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is the structural information marked with the visual relationship between every two objects.

As an optional embodiment, the visual relationship detection model includes a predicate identification network, and the inputting the object detection information into the visual relationship detection model for visual relationship detection to obtain the visual relationship between two objects includes:

inputting the object detection information into the predicate identification network to identify predicates corresponding to predicate relations between every two objects to obtain a target predicate, wherein the target predicate represents the predicates after semantic adjustment;

and obtaining the visual relationship according to the target predicate and the object corresponding to the target predicate.

As an optional embodiment, the predicate identification network includes an initial relevance calculation layer and a semantic adjustment layer, and the inputting the object detection information into the predicate identification network to perform predicate identification between two objects to obtain a target predicate includes:

inputting the object detection information and preset predicates into the initial relevance calculation layer, and performing relevance calculation on predicates corresponding to every two pieces of object detection information and each preset predicate to obtain initial relevance distribution information, wherein the initial relevance distribution information represents the relevance between the predicates corresponding to the two pieces of object detection information and each preset predicate before semantic adjustment;

inputting the initial relevancy distribution information into a semantic adjustment layer, performing predicate semantic adjustment on the initial relevancy distribution information based on the preset matrix to obtain target relevancy distribution information, wherein the target relevancy distribution information represents the relevancy between a predicate corresponding to the two pairs of object detection information after semantic adjustment and each preset predicate;

and determining the target predicate according to the target relevance distribution information.

As an optional embodiment, the inputting the initial relevancy distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial relevancy distribution information based on the preset matrix to obtain the target relevancy distribution information includes:

determining an initial predicate according to the initial correlation degree distribution information;

carrying out predicate semantic adjustment on the initial relevancy distribution information based on a semantic adjustment matrix in the preset matrix under the condition that the initial predicate is a general predicate, wherein the general predicate represents a predicate of which the use probability is greater than a preset threshold in the preset predicate;

and under the condition that the initial predicate is a non-universal predicate, determining the initial relevancy distribution information as the target relevancy distribution information based on a semantic keeping matrix in the preset matrix, wherein the non-universal predicate characterizes the predicate of which the use probability is smaller than a preset threshold value in the preset predicate.

As an optional embodiment, the method further comprises:

inputting an annotation image into the image detection model for object detection to obtain training object detection information corresponding to each object in the annotation image, wherein the annotation image is labeled with a reference visual relationship between every two objects;

inputting the training object detection information into a first model to be trained for visual relationship detection to obtain a first training visual relationship between every two objects, wherein the first training visual relationship represents an interactive relationship between every two objects in the labeled image obtained through the first model to be trained;

inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained for scene graph generation to obtain a first training scene graph corresponding to the labeled image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;

and training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.

As an optional embodiment, after the training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further includes:

detecting the word frequency information corresponding to each reference predicate in the reference visual relationship;

classifying the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain the type of the reference predicates corresponding to each labeled image;

combining the first visual relation detection model with a preset matrix to obtain a second visual relation detection model;

inputting the training object detection information into the second visual relation detection model for visual relation detection to obtain a second training visual relation between every two objects, wherein the second training visual relation represents the interactive relation between every two objects in the labeled image under the condition that a preset matrix exists;

inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model for scene graph generation to obtain a second training scene graph corresponding to the labeled image, wherein the second training scene graph is marked with the structural information of the second training visual relationship between every two objects;

and adjusting the second visual relation detection model and the initial scene graph generation model based on the reference predicate type, the second training visual relation and the reference visual relation corresponding to each labeled image to obtain the visual relation detection model and the scene graph generation model.

As an alternative embodiment, the method comprises:

inputting the training object detection information into the first visual relation detection model for visual relation detection to obtain an initial visual relation between every two objects, wherein the initial visual relation represents an interactive relation between every two objects in the labeled image obtained through the first visual relation detection model;

inputting training object detection information corresponding to the initial visual relationship and the first training visual relationship into the initial scene graph generation model for scene graph generation to obtain an initial scene graph corresponding to the labeled image, wherein the initial scene graph is the structural information labeled with the initial visual relationship between every two objects;

determining an initial matrix according to a predicate in the initial visual relationship and a reference predicate in the reference visual relationship;

and obtaining a preset matrix according to the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix.

According to a second aspect of the embodiments of the present disclosure, there is provided an image processing apparatus, the apparatus comprising;

the object detection module is configured to input an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed;

the visual relationship detection module is configured to input the object detection information into a visual relationship detection model for visual relationship detection to obtain a visual relationship between every two objects, and the visual relationship represents an interactive relationship between every two objects in the image to be processed;

and the scene graph generation module is configured to execute inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model for scene graph generation, so as to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is the structural information marked with the visual relationship between every two objects.

As an optional embodiment, the visual relationship detection model includes a predicate identification network and a visual relationship determination network, and the visual relationship detection module includes:

the predicate identification unit is configured to input the object detection information into the predicate identification network to perform predicate identification corresponding to a predicate relation between every two objects to obtain a target predicate, and the target predicate represents a predicate after semantic adjustment;

and the visual relation determining unit is configured to execute the object corresponding to the target predicate and the target predicate to obtain the visual relation.

As an optional embodiment, the predicate identification network includes an initial relevance calculation layer and a semantic adjustment layer, and the predicate identification unit includes:

the initial correlation degree calculation unit is configured to input the object detection information and preset predicates into the initial correlation degree calculation layer, perform correlation degree calculation on predicates corresponding to every two pieces of object detection information and each preset predicate to obtain initial correlation degree distribution information, and the initial correlation degree distribution information represents the correlation degree between the predicates corresponding to the every two pieces of object detection information and each preset predicate before semantic adjustment;

the semantic adjusting unit is configured to input the initial relevance distribution information into a semantic adjusting layer, perform predicate semantic adjustment on the initial relevance distribution information based on the preset matrix to obtain target relevance distribution information, wherein the target relevance distribution information represents a predicate corresponding to the two pairs of object detection information after semantic adjustment and a relevance between each preset predicate;

and the target predicate determination unit is configured to determine the target predicate according to the target relevance distribution information.

As an optional embodiment, the semantic adjusting unit includes:

an initial predicate determination unit configured to determine an initial predicate according to the initial relevance distribution information;

a first semantic adjusting unit, configured to perform predicate semantic adjustment on the initial relevancy distribution information based on a semantic adjusting matrix in the preset matrix if the initial predicate is a general predicate, where the general predicate characterizes a predicate in the preset predicate whose usage probability is greater than a preset threshold;

and the second semantic adjusting unit is configured to determine the initial relevancy distribution information as the target relevancy distribution information based on a semantic keeping matrix in the preset matrix when the initial predicate is a non-universal predicate, wherein the non-universal predicate characterizes a predicate of which the use probability is smaller than a preset threshold value in the preset predicate.

As an optional embodiment, the apparatus further comprises:

the first training feature extraction module is configured to input an annotation image into the image detection model for feature extraction, so as to obtain training object detection information corresponding to each object in the annotation image, wherein the annotation image is labeled with a reference visual relationship between every two objects;

the first training visual relationship detection module is configured to input the training object detection information into a first model to be trained for visual relationship detection to obtain a first training visual relationship between every two objects, and the first training visual relationship represents an interaction relationship between every two objects in the labeled image obtained through the first model to be trained;

a first training scene graph generation module configured to perform scene graph generation by inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained, so as to obtain a first training scene graph corresponding to the labeled image, where the first training scene graph is structural information labeled with the first training visual relationship between the two objects;

and the model training module is configured to train the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.

As an optional embodiment, the apparatus further comprises:

the word frequency information detection module is configured to detect the word frequency information corresponding to each reference predicate in the reference visual relationship;

a second visual relation detection model obtaining module configured to perform combining the first visual relation detection model and a preset matrix to obtain a second visual relation detection model;

a second training visual relationship obtaining module configured to perform visual relationship detection by inputting the training object detection information into the second visual relationship detection model to obtain a second training visual relationship between the two objects, where the second training visual relationship represents an interaction relationship between the two objects in the labeled image obtained by the second visual relationship detection model;

a second training scene graph obtaining module configured to perform scene graph generation by inputting the second training visual relationship and training object detection information corresponding to the first training visual relationship into the initial scene graph generation model, to obtain a second training scene graph corresponding to the labeled image, where the second training scene graph is structure information labeled with the second training visual relationship between the two objects;

and the model adjusting module is configured to perform adjustment on the second visual relationship detection model and the initial scene graph generation model based on the reference predicate type, the second training visual relationship and the reference visual relationship corresponding to each labeled image, so as to obtain the visual relationship detection model and the scene graph generation model.

As an optional embodiment, the apparatus further comprises:

the initial visual relationship detection module is configured to input the object detection information into the first visual relationship detection model for visual relationship detection to obtain an initial visual relationship between every two objects, and the initial visual relationship represents an interaction relationship between every two objects in the annotation image obtained through the first visual relationship detection model;

a scene initial graph generation module configured to perform scene graph generation by inputting the initial visual relationship and training object detection information corresponding to the first training visual relationship into the initial scene graph generation model, so as to obtain an initial scene graph corresponding to the labeled image, where the initial scene graph is structural information labeled with the initial visual relationship between the two objects;

an initial matrix determination module configured to perform determining an initial matrix from a predicate in the initial visual relationship and a reference predicate in the reference visual relationship;

and the preset matrix determining module is configured to execute the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix to obtain a preset matrix.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the image processing method described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed, inputting the object detection information into a visual relationship detection model for visual relationship detection to obtain a visual relationship between every two objects, wherein the visual relationship is obtained by adjusting semantic information quantity corresponding to the visual relationship through the visual relationship detection model, and inputting the visual relationship into a scene graph generation model for scene graph generation to obtain a target scene graph corresponding to the image to be processed. The method is based on the visual relationship detection model, the visual relationship between every two objects is detected, and the accuracy of visual relationship detection can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an application scenario of an image processing method according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating an image processing method according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating predicate identification in an image processing method according to an example embodiment.

FIG. 4 is a flow diagram illustrating predicate semantic adjustment in an image processing method according to an example embodiment.

FIG. 5 is a flow diagram illustrating model training in an image processing method according to an exemplary embodiment.

FIG. 6 is a flow diagram illustrating adaptation of a trained model in a method of image processing according to an example embodiment.

Fig. 7 is a diagram illustrating migration learning in an image processing method according to an exemplary embodiment.

FIG. 8 is a diagram illustrating parameter fixing during model training in an image processing method according to an exemplary embodiment.

Fig. 9 is a schematic diagram illustrating a process of inputting a picture to be processed and generating a target scene graph in an image processing method according to an exemplary embodiment.

Fig. 10 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment.

FIG. 11 is a block diagram illustrating a server-side electronic device in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment, and as shown in fig. 1, the application scene includes a client 110 and a server 120. The client 110 acquires an image to be processed, the server 120 receives the image to be processed sent from the client 110, and the server 120 inputs the image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed. The server 120 inputs the object detection information into the visual relationship detection model for visual relationship detection to obtain a visual relationship between two objects, and inputs the object detection information corresponding to the visual relationship and the visual relationship into the scene graph generation model for image processing to obtain a target scene graph corresponding to the image to be processed. The server 120 sends the target scene graph to the client 110 for display.

In the embodiment of the present disclosure, the client 110 includes a physical device of a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, and the like, and may also include software running in the physical device, such as an application program and the like. The operating system running on the entity device in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, Unix, windows, and the like. The client 110 includes a User Interface (UI) layer, and the client 110 provides display of a target scene graph and acquisition of an image to be processed to the outside through the UI layer, and sends the image to be processed to the server 120 based on an Application Programming Interface (API).

In the disclosed embodiment, the server 120 may include a server operating independently, or a distributed server, or a server cluster composed of a plurality of servers. The server 120 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 120 may be configured to publish the published content through a publishing channel, mark a channel tracking code on the publishing channel by using a tracking code application interface, an identification parameter, or a fixed tracking code, and receive log data fed back by the user terminal.

In the embodiment of the present disclosure, the server 120 may perform visual relationship detection on the object detection information based on a technique of visual relationship detection. Visual relationship detection combines images with semantics and needs to identify not only objects and their positions in the images, but also relationships between objects, a visual relationship is defined as a pair of objects connected by a predicate, usually expressed in the form of a principal-predicate, that can be used to describe the interaction between two objects. The visual relation detection is the basis of image understanding, and can be applied to object detection, image description, visual question answering, image retrieval and the like.

Fig. 2 is a flowchart illustrating an image processing method, as shown in fig. 2, for use in a server, according to an exemplary embodiment, including the following steps.

S210, inputting the image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed respectively;

as an optional embodiment, in the image detection model, each object in the image to be processed is detected according to a preset labeling frame, and a detection area of each object in the image to be processed is extracted. Extracting the characteristic information corresponding to each object in the detection area corresponding to each object, and determining the object in the image to be processed according to the characteristic information corresponding to each object, thereby obtaining object detection information. The image detection model can be different image detection models such as a Faster R-CNN model, a Fast R-CNN model, an R-CNN model and the like. When the object detection information is input into the visual relation detection model for visual relation detection, the visual relation corresponding to two pairs of object detection information can be detected.

As an alternative embodiment, the combined detection information may also be obtained based on the image detection model. The image detection model can detect two objects in the image to be processed, extract detection areas of the two objects in the image to be processed, extract joint feature information corresponding to the two objects from the detection areas corresponding to the two objects, and determine the two objects in the image to be processed according to the joint feature information, so that combined detection information is obtained. The combined detection information includes two object detection information having an interaction relationship in the image to be processed. When the combined detection information is input into the visual relation detection model for visual relation detection, the visual relation corresponding to the combined detection information can be detected.

The combination of the two objects without interaction relation can be eliminated by utilizing the combination detection information, so that the data volume to be detected is reduced, and the efficiency of visual relation detection in the subsequent steps is improved.

S220, inputting object detection information into a visual relation detection model to perform visual relation detection to obtain a visual relation between every two objects, wherein the visual relation represents an interactive relation between every two objects in the image to be processed;

as an alternative embodiment, the interaction relationship between two objects may include an action relationship, a spatial relationship, a preposition relationship and a comparison relationship. An action relationship may be expressed as an object making a certain action on another object, e.g. a person riding a bicycle, and a spatial relationship may be expressed as a relative position between the two objects, e.g. a cup to the left of the book. A prepositional relationship can be expressed as an association between two objects in information such as membership, status, orientation, etc., for example, a vehicle has tires. A comparison relationship may indicate a distinction between two objects, for example, a first apple is larger than a second apple. The visual relationship detection model can detect the visual relationship between the objects corresponding to the detection information of every two objects and perform semantic adjustment on the visual relationship. The visual relationship between two objects may correspond to a triplet formed by the subject and the object with two objects as well as a predicate between the subject and the object. When the visual relation detection model carries out semantic adjustment on the visual relation, the information quantity of predicates between the subject and the object can be adjusted to obtain the predicates with rich meanings.

As an optional embodiment, the visual relationship detection model includes a predicate identification network, and the step of inputting the object detection information into the visual relationship detection model for visual relationship detection to obtain the visual relationship between two objects includes:

inputting the object detection information into a predicate identification network to identify predicates corresponding to predicate relations between every two objects to obtain a target predicate;

and obtaining a visual relation according to the target predicate and the object corresponding to the target predicate.

As an optional embodiment, the visual relationship detection model includes a predicate identification network, where a predicate corresponding to a predicate relationship between two objects is identified in the predicate identification network, the identified target predicate is a semantically adjusted predicate, and the two objects and the corresponding target predicate form a visual relationship having a subject, a predicate, and an object.

And the objects corresponding to the two pairs of object detection information have predicate relations, predicates corresponding to the predicate relations are identified in a predicate identification network, and semantic adjustment can be performed to obtain target predicates. The target predicate corresponds to two objects, one of which can be a subject and the other can be an object, so that a visual relationship having the subject, predicate and object can be determined.

The target predicate between every two objects is determined through the predicate identification network, and the predicate identification network comprises a semantic adjustment layer, so that the accuracy of predicate identification can be improved.

As an optional embodiment, please refer to fig. 3, where the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and the object detection information is input into the predicate identification network to perform predicate identification between two objects, so as to obtain a target predicate, where the target predicate includes:

s310, inputting the object detection information and preset predicates into an initial relevance calculation layer, and calculating the relevance of predicates corresponding to every two object detection information and each preset predicate to obtain initial relevance distribution information;

s320, inputting the initial correlation degree distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial correlation degree distribution information based on a preset matrix to obtain target correlation degree distribution information;

s330, determining a target predicate according to the target relevance degree distribution information.

As an optional embodiment, the predicate identification network includes an initial relevance calculation layer and a semantic adjustment layer, where the initial relevance calculation layer may be configured to calculate initial relevance distribution information, and the semantic adjustment layer may be configured to perform semantic adjustment on the initial relevance distribution information to obtain target relevance distribution information. The preset predicates comprise a plurality of predicates, the initial relevance distribution information is the probability distribution of a certain preset predicate determined by the predicates corresponding to the pairwise object detection information before semantic adjustment, and the relevance between the predicates corresponding to the pairwise object detection information before semantic adjustment and each preset predicate is represented.

When predicate semantic adjustment is performed on the initial relevance distribution information based on the preset matrix to obtain target relevance distribution information, a predicate corresponding to the initial relevance distribution information before the semantic adjustment and a predicate corresponding to the target relevance distribution information after the semantic adjustment have semantic relevance, for example, the predicate corresponding to the initial relevance distribution information is "on top", the predicate corresponding to the target relevance distribution information is "riding", wherein the "riding" also has the meaning of "on top", and the two predicates have semantic relevance, it can be determined that the semantic adjustment is correct. If the predicate corresponding to the initial relevance distribution information is 'upper', the predicate corresponding to the target relevance distribution information cannot be adjusted to 'lower', because the semantics of 'upper' and 'lower' are completely opposite, and the two predicates have no semantic relevance.

And the target relevance distribution information represents the relevance between predicates corresponding to the object detection information and each preset predicate after semantic adjustment. According to the magnitude of each correlation degree in the target correlation degree distribution information, the maximum value of the correlation degree in the target correlation degree distribution information can be determined, and a preset predicate corresponding to the maximum value of the correlation degree is determined as a target predicate.

As an optional embodiment, a calculation formula when identifying predicates corresponding to predicate relationships between two pairs of object detection information may be represented as:

wherein the content of the first and second substances,

in order to express the probability distribution of the predicate before semantic adjustment, the initial relevance distribution information can be obtained through calculation by the formula. R represents preset predicates, K represents the number of the preset predicates, and the preset predicates comprise different types of predicates. y is_iAnd the expression predicates comprise g marked predicates for expressing the output before semantic adjustment and s marked predicates for expressing the output after semantic adjustment.(o_j,o_k) Represents the pair of object detection information, and θ represents a model parameter in the visual relationship detection model. And determining the degree of correlation between the object detection information and the corresponding predicate and each preset predicate through the probability distribution of the predicate before semantic adjustment.

The probability distribution of the predicates after semantic adjustment can be represented, target relevancy distribution information can be obtained through calculation through the formula, the probability distribution of the predicates before semantic adjustment is adjusted on the basis of a semantic adjustment matrix to obtain the probability distribution of the predicates after semantic adjustment, the degree of correlation between the predicates corresponding to the object detection information after semantic adjustment and each preset predicate is determined, and the preset predicate corresponding to the maximum value in the degree of correlation is determined as the predicate corresponding to the object detection information.

The probability distribution of semantic adjustment is represented, and the confidence of converting a predicate without rich semantic information amount into a predicate with rich semantic information amount can be measured, wherein s superscript represents output after semantic adjustment, and g superscript represents output before semantic adjustment.

A preset matrix may be used instead, and the preset matrix may perform semantic adjustment on the initial correlation distribution information in the semantic adjustment layer.

And performing predicate semantic adjustment on the initial relevancy distribution information through the semantic adjustment layer to obtain a target predicate with rich semantic information amount, so that the accuracy of predicate identification is improved, and the effectiveness of a target scene graph generated in the subsequent step can be improved.

As an optional embodiment, referring to fig. 4, inputting the initial relevancy distribution information into a semantic adjustment layer, performing predicate semantic adjustment on the initial relevancy distribution information based on a preset matrix, and obtaining target relevancy distribution information includes:

s410, determining an initial predicate according to the initial relevance degree distribution information;

s420, under the condition that the initial predicate is the universal predicate, carrying out predicate semantic adjustment on the initial relevancy distribution information based on a semantic adjustment matrix in a preset matrix, wherein the universal predicate represents the predicate of which the use probability is greater than a preset threshold in the preset predicate;

and S430, under the condition that the initial predicate is the non-universal predicate, determining the initial relevance distribution information as target relevance distribution information based on a semantic keeping matrix in a preset matrix, wherein the non-universal predicate represents the predicate of which the use probability is smaller than a preset threshold value in the preset predicate.

As an optional embodiment, based on shannon semantic information amount theory, the semantic information amount contained in a predicate can be determined by the probability of occurrence of the predicate, and the predicate with a low probability of occurrence contains more semantic information amounts. And determining whether one predicate is a universal predicate or a non-universal predicate according to the semantic information amount, wherein if the predicate is the universal predicate, the semantic information amount contained in the universal predicate is low, and the use probability of the universal predicate in a preset predicate is greater than a preset threshold. If the predicate is a non-universal predicate, the semantic information content contained in the non-universal predicate is high, and the use probability of the non-universal predicate in the preset predicate is smaller than a preset threshold. For example, a person is on a bicycle, where "on" may describe the relationship of the person and the bicycle shop, but "riding" in "people riding a bicycle" represents the action that the person is on a bicycle, so "riding" has a greater amount of semantic information than "on". And "on" represents only the relative relationship of the positions between two objects, and "ride" represents an action, and "on" can be applied where "ride" can be applied, such as "people ride a horse" and "people are on the horse, but" ride "can not necessarily be applied when" on "is applied, such as" book is on the desk ", so the probability of" ride "occurring is significantly lower than that of" on ", and thus predicates with lower probability of occurrence have more semantic information content.

When the initial relevance distribution information is input into a semantic adjustment matrix for predicate semantic adjustment, under the condition that an initial predicate corresponding to the initial relevance distribution information is a general predicate, the initial predicate indicates that the amount of semantic information contained in the initial predicate is small, the relevance distribution in the initial relevance distribution information can be adjusted through the semantic adjustment matrix to obtain target relevance distribution information, and therefore predicates corresponding to two pairs of object detection information are associated with preset predicates containing a large amount of semantic information to obtain the target predicates.

Under the condition that the initial predicate corresponding to the initial relevance distribution information is a non-universal predicate, the fact that the quantity of semantic information contained in the initial predicate is large is described, the relevance distribution in the initial relevance distribution information can be not adjusted through a semantic keeping matrix, the initial predicate corresponding to the initial relevance distribution information is kept, and therefore the predicate corresponding to two pairs of object detection information is associated with the preset predicate containing a large quantity of semantic information, and the target predicate is obtained.

Based on a semantic adjustment matrix in the preset matrix, when a predicate corresponding to a predicate relation between every two pieces of object detection information is identified as a general predicate, semantic adjustment can be performed on the initial relevancy distribution information, and an identification result with a small semantic information amount is converted into an identification result with a large semantic information amount. Based on a semantic keeping matrix in the preset matrix, when the predicate corresponding to the predicate relation between every two pieces of object detection information is identified to be a non-universal predicate, semantic adjustment on the initial relevancy distribution information is reduced, and therefore the identification result with rich semantic information content is kept.

When semantic adjustment is carried out, the recognition result without rich semantic information amount is adjusted, and the recognition result with rich semantic information amount is kept, so that wrong adjustment of predicates with rich semantic information amount can be avoided, and the effectiveness of semantic adjustment is improved.

And S230, inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model for scene graph generation to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information marked with the visual relationship between every two objects.

As an optional embodiment, the visual relationship and the object detection information corresponding to the visual relationship are input into the scene graph generation model, and according to the visual relationship between the objects corresponding to the pair of object detection information, a target scene graph can be obtained, where the target scene graph is structural information formed by points and edges, and the points in the target scene graph can represent the objects and the edges can represent the visual relationship between the pair of objects. Visual relationships with subjects, predicates, and objects may be displayed in the target scene graph. For example, visual relationships (human, presence, hand) are displayed in the target scene graph, where "human" is the subject, "presence" is the predicate, and "hand" is the object.

As an alternative embodiment, the method further includes a model training method, please refer to fig. 5, the model training method includes:

s510, inputting the annotated image into an image detection model for object detection to obtain training object detection information corresponding to each object in the annotated image;

s520, inputting the detection information of the training objects into a first model to be trained for visual relationship detection to obtain a first training visual relationship between every two objects;

s530, inputting a first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained for scene graph generation to obtain a first training scene graph corresponding to a labeled image;

and S540, training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.

As an optional embodiment, during model training, the image detection model is a pre-trained model, so that training object detection information required during training can be extracted through the image detection model, and the training object detection information is object detection information corresponding to an object in the labeled image. And acquiring a labeled image by adopting a fully supervised training mode, wherein the labeled image is labeled with a reference visual relationship between every two objects. And inputting the marked image into an image detection model, detecting each object in the marked image according to a preset marking frame, and extracting a detection area of each object in the marked image. And extracting the characteristic information corresponding to each object in the detection area corresponding to each object, and determining the object in the labeled image according to the characteristic information corresponding to each object, thereby obtaining the training object detection information. At this time, the first model to be trained does not include the preset matrix, that is, the first model to be trained does not have the function of semantic adjustment. After the pre-set matrix to be trained is trained based on the training visual relationship and the reference visual relationship, the trained pre-set matrix is added to the first visual relationship detection model.

The method comprises the steps of inputting training object detection information into a first model to be trained for visual relationship detection, identifying predicates corresponding to predicate relationships between two pairs of training object detection information based on a first network to be trained in the first model to be trained to obtain a first training target predicate, combining objects corresponding to the training target predicate and the first training target predicate based on a second network to be trained in the first model to be trained to obtain a first training visual relationship with a subject, a predicate and an object, and the first training visual relationship represents an interactive relationship between two pairs of objects in a labeling image obtained through the first model to be trained.

Inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained for scene graph generation, labeling the first training visual relationship between objects in a labeled image in the second model to be trained to obtain a first training scene graph corresponding to the labeled image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects.

And calculating first loss data between the first training visual relationship and the reference visual relationship, wherein the first loss data can be a loss function between the first training visual relationship and the reference visual relationship, and training a first model to be trained and a second model to be trained according to the first loss data to obtain a first visual relationship detection model and an initial scene graph generation model, and the first visual relationship detection model is a visual relationship detection model without a preset matrix.

When model training is carried out, only a visual relation detection model and a scene graph generation model need to be trained, most training steps are completed based on a source domain, and only fine tuning needs to be carried out on a target domain subsequently, so that the training cost is reduced.

As an alternative embodiment, referring to fig. 6, after the first model to be trained and the second model to be trained are trained according to the training visual relationship and the reference visual relationship, and the first visual relationship detection model and the initial scene graph generation model are obtained, the method further includes:

s610, detecting the word frequency information corresponding to each reference predicate in the reference visual relationship;

s620, classifying the reference predicates according to preset word frequency segmentation information and word frequency information to obtain the type of the reference predicates corresponding to each labeled image;

s630, combining the first visual relation detection model with a preset matrix to obtain a second visual relation detection model;

s640, inputting the detection information of the training objects into a second visual relation detection model for visual relation detection to obtain a second training visual relation between every two objects;

s650, inputting the second training visual relationship and training object detection information corresponding to the second training visual relationship into an initial scene graph generation model for scene graph generation, and obtaining a second training scene graph corresponding to the labeled image;

and S660, adjusting the second visual relation detection model and the initial scene graph generation model based on the reference predicate type, the second training visual relation and the reference visual relation corresponding to each labeled image to obtain a visual relation detection model and a scene graph generation model.

As an alternative embodiment, please refer to fig. 7, which is a schematic diagram of migration learning shown in fig. 7. And detecting the word frequency information corresponding to each reference predicate in the reference visual relationship because the predicate with low occurrence probability contains more semantic information, wherein the word frequency information represents the occurrence probability of each reference predicate in all reference predicates. Calculating the semantic information quantity contained in each reference predicate according to the word frequency information, wherein a formula for estimating the semantic information quantity contained in the reference predicate is as follows:

I(y_i)＝-log_b[Pr(y_i)]

wherein, y_iExpress predicate, Pr (y)_i) Representing the probability of occurrence of word frequency information, I (y), predicates_i) Representing the amount of semantic information in the predicate. The predicate having smaller word frequency information contains a larger amount of semantic information, and the predicate having larger word frequency information contains a smaller amount of semantic information. And sequencing the reference predicates from small to large according to the size of the semantic information quantity to obtain a reference predicate sequence. And taking the preset number of reference predicates starting from the first reference predicate as a universal predicate, taking the reference predicates except the preset number of reference predicates as non-universal predicates, and dividing the reference predicates into two types. The universal predicate is a predicate with the occurrence probability larger than a preset probability, the non-universal predicate is a predicate with the occurrence probability smaller than the preset probability, and the preset probability corresponds to the occurrence probability of the last reference predicate in the preset number of reference predicates. For example, the preset number may be 15, that is, the first fifteen reference predicates of the reference predicate sequence are used as the universal predicates, and the reference predicates after the first fifteen reference predicates are used as the non-universal predicates.

The method comprises the steps of taking an annotated image as a source domain, simultaneously carrying out down-sampling on the annotated image of a universal predicate to obtain a target domain, transferring a first visual relation detection model, a preset matrix and an initial scene graph generation model which are obtained by training on the source domain to the target domain, combining the first visual relation detection model and the preset matrix to obtain a second visual relation detection model, adjusting the last neural network layer in the neural network layers which are sequentially arranged in the second visual relation detection model and the initial scene graph generation model to obtain a visual relation detection model and a scene graph generation model, wherein the last neural network layer is a classification layer. When the second visual relationship detection model and the initial scene graph generation model are adjusted on the target domain, a sample image can be obtained from the labeled image with the reference predicate type for adjustment, and all the labeled images are not required to be used.

When the first visual relationship detection model and the initial scene graph generation model are adjusted, a preset matrix can be obtained, the preset matrix is obtained based on the first visual relationship detection model and the initial scene graph generation model, the preset matrix and the first visual relationship detection model are combined to obtain a second visual relationship detection model, and the second visual relationship detection model is a visual relationship detection model with the preset matrix.

And after the labeled image is input into the image detection model for object detection to obtain training object detection information, inputting the training object detection information into a second visual relationship detection model for visual relationship detection to obtain a second training visual relationship between every two objects, wherein the second training visual relationship represents the interactive relationship between every two objects in the labeled image obtained by the second visual relationship detection model. The preset matrix can perform semantic adjustment on initial relevance distribution information corresponding to the training object detection information to obtain target relevance distribution information, and can determine a second target training predicate according to the target relevance distribution information to obtain a second training visual relationship.

And inputting the second training visual relationship and the training object detection information corresponding to the second training visual relationship into the initial scene graph generation model for scene graph generation, so as to obtain a second training scene graph corresponding to the labeled image, wherein the second training scene graph is the structural information labeled with the second training visual relationship between every two objects.

The second training visual relationship is a detection result obtained based on the information amount corresponding to the reference predicate type, and second loss data between the reference predicate type and the reference visual relationship and the second training visual relationship are obtained through calculation, wherein the second loss data can be a loss function between the reference predicate type and the reference visual relationship and between the reference predicate type and the second training visual relationship. And adjusting the second visual relation detection model and the initial scene graph generation model according to the second loss data to obtain a visual relation detection model and a scene graph generation model.

The labeled images are classified according to the semantic information content contained in the predicates in the labeled images, and then model adjustment is carried out based on the reference predicate type, the training visual relationship and the reference visual relationship, so that the model has the capability of identifying general predicates and non-general predicates, and the predicate identification accuracy is improved. And the second visual relation detection model and the initial scene graph generation model are adjusted instead of retrained, so that the problem of overfitting can be avoided.

As an alternative embodiment, please refer to fig. 8, the training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, and the obtaining the first visual relationship detection model and the initial scene graph generation model includes:

s810, inputting the detection information of the training objects into a first visual relation detection model for visual relation detection to obtain an initial visual relation between every two objects;

s820, inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model for scene graph generation to obtain an initial scene graph corresponding to a labeled image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects;

s830, determining an initial matrix according to predicates in the initial visual relationship and reference predicates in the reference visual relationship;

and S840, obtaining a preset matrix according to the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix.

As an optional embodiment, the training object detection information is input into the first visual relationship detection model for visual relationship detection, so that initial relevance distribution information can be obtained, an initial predicate can be determined according to the initial relevance distribution information, an initial visual relationship between two objects can be obtained through the initial predicate, and the initial visual relationship represents an interaction relationship between two objects in the annotation image obtained through the first visual relationship detection model.

And inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model for scene graph generation to obtain an initial scene graph corresponding to a labeled image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects. And comparing the predicates in the initial visual relationship with the reference predicates in the reference visual relationship, so as to determine predicates with correct classification and predicates with wrong classification.

As an alternative embodiment, the preset matrix may be represented as:

C^*∈R^K×K

wherein, the matrix C^*And expressing a preset matrix, wherein R represents preset predicates, K represents the number of the preset predicates, and the preset predicates are different predicates. In the process of obtaining the preset matrix, firstly, initializing a confusion matrix for identifying predicates to obtain an initial matrix, wherein the initial matrix can be expressed as:

C∈R^K×K

each element in the initial matrix is denoted C_j,kAnd the element is expressed as labeled as a j-th type predicate, but is predicted as the number of k-th type predicates, wherein j can be equal to k, and when j is equal to k, the labeled result of the predicate and the identified result of the predicate are consistent.

Elements in a semantic adjustment matrix of the preset matrix are represented as predicates labeled as a j-th type predicate, but are predicted as the number of predicates of a k-th type predicate, so that the elements in the semantic adjustment matrix can be determined according to the number of predicates which are determined to be classified correctly and the number of predicates which are classified incorrectly, for example, 100A predicates exist in the reference predicates, the A predicates correspond to the category number of 3, but in the initial visual relationship, only 50A predicates exist in the predicates corresponding to the 100A predicates, in addition, 30B predicates and 20C predicates exist, the B predicates correspond to the category number of 4, and the C predicates correspond to the category number of 5. The number of predicates classified correctly is 50, and the number of predicates classified incorrectly is 30 and 20, which can be represented as C in the matrix_3,3，C_3,4And C_3,5。

And obtaining a preset matrix according to the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix. The normalization matrix corresponding to the initial matrix is a semantic adjustment matrix, and the unit matrix corresponding to the initial matrix is a semantic retention matrix. Normalizing the initial matrix to obtain a normalized semantic adjustment matrix C', which can be calculated by the following formula:

the semantic adjustment matrix C ' represents the semantic relevance between the predicates to a certain extent, but the value of diagonal elements of the semantic adjustment matrix C ' on the predicates with abundant semantic information amount is small, and the probability of the identified predicates with abundant semantic information amount can be reduced by directly multiplying the initial relevance distribution information by the semantic adjustment matrix C '. Therefore, a semantic retention matrix is added to the semantic adjustment matrix C', the identification result of predicates with rich semantic information can be retained based on the semantic retention matrix, and the preset matrix C can be obtained^*. The specific formula is as follows:

C^*＝(C′+I_K)×0.5

wherein, I_K∈R^K×KIs a unitary matrix, and multiplying the whole formula by 0.5 ensures that the sum of the elements of each row of the predetermined matrix is 1.

And after the preset matrix is obtained, adding the preset matrix into the initial scene graph generation model, so that the initial scene graph generation model has a semantic adjustment function.

In the training process, the trained preset matrix is used, and parameters in the preset matrix are fixed, so that the situation of semantic drift can be avoided, and the accuracy of semantic adjustment is improved.

As an alternative embodiment, please refer to fig. 9, which is a schematic diagram of a process of inputting a to-be-processed picture and generating a target scene graph shown in fig. 9. The image to be processed is input into the image detection model for object detection, the marking frame and the characteristic information of each object are obtained through detection, the positions of the four objects of the racket, the hand, the person and the short sleeve can be determined, and the object detection information of the four objects of the racket, the hand, the person and the short sleeve is obtained. And inputting the object detection information into a visual relation detection model to obtain the visual relation between every two objects. In the visual relationship detection model, predicates corresponding to two objects, namely a racket and a hand, are identified, so that a target predicate on the upper surface can be obtained, and the visual relationship (the racket on the upper surface and the hand) consisting of a subject, the predicate and the object can be determined. By recognizing predicates corresponding to the two objects, namely the person and the hand, the target predicate 'present' can be obtained, and the visual relation (person, present and hand) formed by the subject, the predicate and the object can be determined. The predicates corresponding to the two objects of the short-sleeve shirt and the person are identified, the target predicate on the upper side can be obtained, and the visual relationship formed by the subject, the predicate and the object can be determined (the short-sleeve shirt and the person on the upper side). Inputting the visual relation (racket, upper surface, hand), the visual relation (person, upper surface, hand), the visual relation (short sleeve, upper surface, person) and the object detection information of the four objects of the racket, the hand, the person and the short sleeve into a scene graph generation model, and marking the visual relation between every two objects in the object detection information corresponding to each object to obtain a target scene graph. For example, when the image is retrieved, the input retrieval information is an image of a person wearing a short sleeve, and a target scene graph with a visual relationship (of the person on the top side) in the target scene graphs generated by the visual relationship detection model and the scene graph generation model can be searched to obtain a retrieval result. Or when the user inputs the question of 'what to wear on the person', the answer 'short sleeve' is obtained by identifying the visual relationship of the target scene graph (short sleeve, on the top, the person) generated by the visual relationship detection model and the scene graph generation model, so that the visual question and answer is completed.

The embodiment of the disclosure provides an image processing method, which includes inputting an image to be processed into an image detection model for object detection, obtaining object detection information, inputting the object detection information into a visual relationship detection model for visual relationship detection, performing semantic adjustment on predicates corresponding to predicate relationships between two detected objects in the process of visual relationship detection, enabling the target predicates to contain abundant semantic information, and therefore improving accuracy of predicate identification, generating visual relationships between two detected objects through the target predicates and the objects corresponding to the target predicates in subsequent steps, inputting the visual relationships into a scene graph generation model, and generating a target scene graph, so that accuracy of visual labeling relationships in the target scene graph is improved, and effectiveness of the target scene graph is improved.

Fig. 10 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment. Referring to fig. 10, the apparatus includes:

the object detection module 1010 is configured to input the image to be processed into the image detection model for object detection, so as to obtain object detection information corresponding to at least two objects in the image to be processed;

a visual relationship detection module 1020 configured to perform visual relationship detection by inputting object detection information into a visual relationship detection model to obtain a visual relationship between two objects, where the visual relationship represents an interaction relationship between two objects in the image to be processed;

and the scene graph generation module 1030 is configured to perform image processing by inputting the visual relationship and the object detection information corresponding to the visual relationship into the scene graph generation model, so as to obtain a target scene graph corresponding to the image to be processed, where the target scene graph is structural information labeled with the visual relationship between every two objects.

As an alternative embodiment, the visual relationship detection model includes a predicate identification network, and the visual relationship detection module 1020 includes:

the predicate identification unit is configured to input the object detection information into a predicate identification network to perform predicate identification corresponding to a predicate relation between every two objects to obtain a target predicate, and the target predicate represents a predicate after semantic adjustment;

a visual relationship determination unit configured to perform the target predicate and an object corresponding to the target predicate as the visual relationship.

the initial relevance calculating unit is configured to input the object detection information and preset predicates into an initial relevance calculating layer, carry out relevance calculation on predicates corresponding to every two object detection information and each preset predicate to obtain initial relevance distribution information, and the initial relevance distribution information represents relevance between predicates corresponding to every two object detection information and each preset predicate before semantic adjustment;

the semantic adjusting unit is configured to input the initial relevance distribution information into a semantic adjusting layer, perform predicate semantic adjustment on the initial relevance distribution information based on a preset matrix to obtain target relevance distribution information, and the target relevance distribution information represents a predicate corresponding to the semantic-adjusted pairwise object detection information and a relevance between each preset predicate;

As an alternative embodiment, the semantic adjusting unit includes:

the first semantic adjusting unit is configured to perform predicate semantic adjustment on the initial relevancy distribution information based on a semantic adjusting matrix in a preset matrix under the condition that the initial predicate is a universal predicate, wherein the universal predicate represents a predicate of which the use probability is greater than a preset threshold in the preset predicate;

and the second semantic adjusting unit is configured to determine the initial relevance distribution information as target relevance distribution information based on a semantic keeping matrix in a preset matrix under the condition that the initial predicate is a non-universal predicate, wherein the non-universal predicate represents a predicate of which the use probability is smaller than a preset threshold value in the preset predicate.

As an optional embodiment, the apparatus further comprises:

the first training feature extraction module is configured to input the labeled image into the image detection model for feature extraction, so as to obtain training object detection information corresponding to each object in the labeled image;

the first training visual relationship detection module is configured to input training object detection information into a first model to be trained for visual relationship detection to obtain a first training visual relationship between every two objects, and the first training visual relationship represents an interactive relationship between every two objects in a labeling image obtained through the first model to be trained;

the first training scene graph generation module is configured to input the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtain a first training scene graph corresponding to the labeled image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;

As an optional embodiment, the apparatus further comprises:

the second visual relation detection model acquisition module is configured to combine the first visual relation detection model and the preset matrix to obtain a second visual relation detection model;

the second training visual relationship acquisition module is configured to input the training object detection information into a second visual relationship detection model for visual relationship detection to obtain a second training visual relationship between every two objects, and the second training visual relationship represents the interaction relationship between every two objects in the labeled image obtained through the second visual relationship detection model;

the second training scene graph acquisition module is configured to input training object detection information corresponding to the second training visual relationship and the second training visual relationship into the initial scene graph generation model for scene graph generation to obtain a second training scene graph corresponding to the labeled image, wherein the second training scene graph is structural information labeled with the second training visual relationship between every two objects;

and the model adjusting module is configured to execute adjustment on the second visual relationship detection model and the initial scene graph generation model based on the reference predicate type, the second training visual relationship and the reference visual relationship corresponding to each labeled image, so as to obtain a visual relationship detection model and a scene graph generation model.

As an optional embodiment, the apparatus further comprises:

the initial visual relationship detection module is configured to input object detection information into the first visual relationship detection model for visual relationship detection to obtain an initial visual relationship between every two objects, and the initial visual relationship represents an interactive relationship between every two objects in the labeled image obtained through the first visual relationship detection model;

the scene initial graph generation module is configured to input the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model for scene graph generation to obtain an initial scene graph corresponding to the labeled image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects;

an initial matrix determination module configured to perform determining an initial matrix from a predicate in an initial visual relationship and a reference predicate in a reference visual relationship;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 11 is a block diagram illustrating an electronic device for image processing, which may be a server, according to an exemplary embodiment, and an internal structure thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement an image processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and does not constitute a limitation on the electronic devices to which the disclosed aspects apply, as a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 1104 comprising instructions, executable by the processor 1120 of the electronic device 1100 to perform the method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising computer instructions which, when executed by a processor, implement the image processing method described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

2. The image processing method of claim 1, wherein the visual relationship detection model comprises a predicate identification network, and the inputting the object detection information into the visual relationship detection model for visual relationship detection to obtain the visual relationship between two objects comprises:

inputting object detection information output by the image detection model into the predicate identification network to identify predicates corresponding to predicate relations between every two objects to obtain a target predicate, wherein the target predicate represents a semantically adjusted predicate;

3. The image processing method of claim 2, wherein the predicate identification network comprises an initial relevance calculation layer and a semantic adjustment layer, and the inputting the object detection information into the predicate identification network for predicate identification between two objects to obtain a target predicate comprises:

4. The image processing method of claim 3, wherein the inputting the initial relevancy distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial relevancy distribution information based on the preset matrix to obtain the target relevancy distribution information comprises:

5. The image processing method according to claim 1, characterized in that the method further comprises:

and training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model, wherein the first visual relationship detection model is a visual relationship detection model without a preset matrix.

6. The image processing method according to claim 5, wherein after the training of the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further comprises:

inputting the training object detection information into the second visual relation detection model for visual relation detection to obtain a second training visual relation between every two objects, wherein the second training visual relation represents an interactive relation between every two objects in the labeled image obtained through the second visual relation detection model;

7. The image processing method according to claim 5, characterized in that the method further comprises:

8. An image processing apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method of any of claims 1 to 7.