CN111475661A

CN111475661A - Method and device for constructing scene graph based on limited tags and computer equipment

Info

Publication number: CN111475661A
Application number: CN202010206574.XA
Authority: CN
Inventors: 陈海波
Original assignee: Deep Blue Technology Shanghai Co Ltd
Current assignee: Shenlan Robot Shanghai Co ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-07-31
Anticipated expiration: 2040-03-23
Also published as: CN111475661B

Abstract

The application discloses a method and a device for constructing a scene graph based on limited labels and computer equipment, relates to the technical field of image processing, and is used for solving the problem that the accuracy of the construction of the scene graph is poor in the prior art. The method comprises the following steps: acquiring an image of a scene graph to be constructed; carrying out entity detection processing on the image through a trained scene graph generation model to determine a bounding box and a label corresponding to each entity in the image, wherein the entity corresponds to people and/or articles appearing in the image, and the label is used for representing and identifying information of the entity; and performing entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generation model to obtain an initial scene graph of each entity, wherein the initial scene graph comprises the label of each entity and a plurality of relationships corresponding to each entity, and the plurality of relationships corresponding to each entity are used for representing incidence relationships between the entity and other entities except the entity.

Description

Method and device for constructing scene graph based on limited tags and computer equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for constructing a scene graph based on a limited tag, and a computer device.

Background

In the prior art, for describing a visual image, scene content in the image is described as much as possible, and besides a significant object in the image, a relationship between objects is also a key point of content interaction in the image.

However, in the current method for constructing a scene graph, a missing object is complementarily described in a manual labeling manner, that is, only information of the object is labeled, and a missing relationship between the object and other objects cannot be complementarily described.

Therefore, the technical problem that the accuracy of the scene graph construction is poor in the prior art exists.

Disclosure of Invention

The application provides a method and a device for constructing a scene graph based on limited tags and computer equipment, which are used for solving the technical problem of poor accuracy of scene graph construction in the prior art. The technical scheme of the application is as follows:

in a first aspect, a method for constructing a scene graph based on finite tags is provided, the method including:

acquiring an image of a scene graph to be constructed;

carrying out entity detection processing on the image through a trained scene graph generation model to determine a bounding box and a label corresponding to each entity in the image, wherein the entity corresponds to people and/or articles appearing in the image, and the label is used for representing and identifying information of the entity;

and performing entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generation model to obtain an initial scene graph of each entity, wherein the initial scene graph comprises the label of each entity and a plurality of relationships corresponding to each entity, and the plurality of relationships corresponding to each entity are used for representing incidence relationships between the entity and other entities except the entity.

In a possible implementation manner, the performing entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generation model includes:

determining the position information of each entity according to the bounding box corresponding to each entity;

obtaining spatial feature vectors of entity pairs corresponding to all subject entities in each entity according to position information between the subject entities and object entities in each entity, wherein when the spatial feature vectors of the entity pairs corresponding to a first entity are determined, the subject entities in the entity pairs are used for representing the first entity in the image, and the object entities in the entity pairs are used for representing other entities except the first entity in the image;

clustering the space characteristic vectors of the entity pairs corresponding to all the main body entities to determine the space diversity characteristic vectors of the entity pair relationship;

performing word embedding processing on the tags of the entities to determine the category tags of the subject entities and the category tags of the object entities in the entities, and performing unified vectorization processing on the category tags of the subject entities and the category tags of the object entities in the entities to obtain category feature vectors of entity-pair relationships, wherein the category tags are used for representing category attributes of the entities in the image;

and counting the number of entity pairs corresponding to the relationship corresponding to the entity pairs to determine the category diversity characteristic vector of the entity pair relationship.

In a possible implementation manner, performing entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generation model to obtain an initial scene graph of each entity, including:

carrying out feature vector selection processing on the space feature vector, the category feature vector, the space diversity feature vector and the category diversity feature vector of the entity pair relationship;

and selecting the processed information and the label according to the feature vector to obtain the initial scene graph of each entity.

In one possible embodiment, the method further comprises:

determining global information of the scene graph to be constructed, wherein the global information comprises information related to a specific scene corresponding to the scene graph to be constructed;

and adding the global information to the initial scene graph to obtain a global scene graph of the image, wherein the global scene graph comprises tags corresponding to entities in the image, a plurality of relationships corresponding to the entities and information of a specific scene corresponding to the image.

In a possible implementation manner, the trained scene graph generation model is trained by the following method, including:

determining a first scene image data set, and carrying out limited labeling on entities and entity relations in the first scene image data set to obtain a limited image semantic data set of a first scene to be trained, wherein the entity relations at least comprise position relations and interaction relations between the entities;

inputting the limited image semantic data set of the first scene to be trained and a plurality of predetermined scene image semantic data sets containing entities in the first scene into a preset scene graph generation model for training to obtain a plurality of output results, wherein the plurality of output results are obtained by training the preset scene graph for a plurality of times;

and comparing the plurality of output results with the limited image semantic data set to obtain a plurality of comparison results, and adjusting model parameters of the preset scene graph generation model according to the plurality of comparison results to obtain the trained scene graph generation model.

In a possible implementation manner, adjusting the model parameters of the preset scenegraph generating model according to the comparison results to obtain a trained scenegraph generating model includes:

determining an overall loss function, wherein the overall loss function is obtained by performing weighted calculation on a first loss function determined by performing the entity detection processing on the image and a second loss function determined by performing the entity relationship prediction on the image;

after the preset scene graph generation model is trained, carrying out convergence inspection on the trained preset scene graph generation model through the overall loss function;

and when the trained preset scene graph generation model is determined to be converged, obtaining the trained scene graph generation model.

In a second aspect, an apparatus for constructing a scene graph based on finite labels is provided, the apparatus comprising:

the acquisition unit is used for acquiring an image of a scene graph to be constructed;

the entity detection processing unit is used for carrying out entity detection processing on the image through the trained scene graph generation model so as to determine a bounding box and a label corresponding to each entity in the image, wherein the entity corresponds to people and/or articles appearing in the image, and the label is used for representing and identifying information of the entity;

and the generating unit is used for performing entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generating model to obtain an initial scene graph of each entity, wherein the initial scene graph comprises the label of each entity and a plurality of relationships corresponding to each entity, and the plurality of relationships corresponding to each entity are used for representing the incidence relationship between the entity and other entities except the entity.

In a possible implementation, the generating unit is configured to:

In a possible implementation, the apparatus further comprises a processing unit configured to:

In a possible implementation manner, the trained scene graph generation model is obtained by training through a model training unit, where the model training unit is configured to:

In a possible embodiment, the model training unit is configured to:

In a third aspect, a computer device is provided, the computer device comprising:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the steps included in any of the methods of the first aspect according to the obtained program instructions.

In a fourth aspect, there is provided a storage medium having stored thereon computer-executable instructions for causing a computer device to perform the steps included in any one of the methods of the first aspect.

In a fifth aspect, a computer program product is provided, which, when run on a computer device, enables the computer device to perform the steps comprised in any of the methods of the first aspect.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

in the embodiment of the application, an image of a scene graph to be constructed can be obtained, and entity detection processing is performed on the image through a trained scene graph generation model, so that a bounding box and a label corresponding to each entity in the image are determined; and performing entity relation prediction processing on the bounding box and the label of each entity through the trained scene graph generation model so as to obtain an initial scene graph of each entity. Specifically, the initial scene graph includes the label of each entity and the multiple relationships corresponding to each entity, and the multiple relationships corresponding to each entity are used to represent the association relationship between the entity and other entities except the entity itself.

That is to say, in the embodiment of the present application, the trained scene graph generation model may predict various relationships between entities in the image and entity pairs corresponding to the entities. In the embodiment of the application, a machine learning mode is utilized to simulate and replace manual data analysis and mode construction, so that negative effects caused by misjudgment and incomplete analysis due to the limitation of analysis capability and subjectivity of an analyst are eliminated as much as possible, the accuracy of analysis and detection can be improved to a certain extent, and the accuracy of a scene graph is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application.

FIG. 1 is a schematic diagram of an application scenario in an embodiment of the present application;

FIG. 2 is a flowchart of a method for constructing a scene graph based on finite tags in an embodiment of the present application;

FIG. 3 is a block diagram illustrating an apparatus for constructing a scene graph based on finite tags according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a computer device in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The terms "first" and "second" in the description and claims of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof, which are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

In the prior art, the structure of the visual scene graph is generally integrated and expanded on the basis of an original data set, and the integrated and expanded mode is based on the supplement of text information, so that the problems that the labeled information is inaccurate, the labeled relation is not corresponding, and even wrong may exist, and the accuracy of the scene graph determined by adopting the data set is low.

In view of this, the embodiment of the present application provides a method for constructing a scene graph by using limited tags, and by using the method, multiple relationships between entities and entity pairs in an image to be constructed can be predicted in a machine learning manner, so as to generate the scene graph, thereby improving the accuracy of constructing the scene graph.

After introducing the design concept of the embodiment of the present application, some simple descriptions are made below on application scenarios applicable to the technical scheme of constructing a scene graph based on a limited tag in the embodiment of the present application, and it should be noted that the application scenarios described in the embodiment of the present application are for more clearly describing the technical scheme of the embodiment of the present application, and do not form limitations on the technical scheme provided in the embodiment of the present application, and a person skilled in the art may know that, with the occurrence of a new application scenario, the technical scheme provided in the embodiment of the present application is also applicable to similar technical problems.

In the embodiment of the present application, the technical solution may be applied to any scene that needs a scene structure, such as a campus, a mall, and the like, and the embodiment of the present application is not limited thereto.

In the specific implementation process, please refer to an application scenario diagram shown in fig. 1, where fig. 1 includes two parts, namely, a processing device including a data set processing unit and a computer device, it should be noted that fig. 1 only shows three processing devices (i.e., a processing device 1, a processing device 2, and a processing device 3) including a data set processing unit and one computer device, and in the specific implementation process, a plurality of processing devices may interact with 1 computer device, or a plurality of processing devices may interact with a plurality of computer devices.

In the embodiment of the application, the processing device may acquire a plurality of data information sets including entities in the campus scene in advance, and then send the acquired data information sets including the entities in the campus scene to the computer device, and the computer device may perform scene graph generation model training using the plurality of data information sets including the entities in the campus scene, so as to obtain a trained scene graph generation model, and then may process an image of a scene graph to be constructed, which is sent by the processing device, using the trained scene graph generation model, so as to obtain a global scene graph of the image of the scene graph to be constructed. In addition, it should be noted that in the embodiment of the present application, the entity may be understood as a person and/or an article in the image.

In order to further explain the scheme of the limited tag construction scene graph provided by the embodiment of the application, the following detailed description is made in conjunction with the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figures when the method is executed in an actual processing procedure or a device (for example, a parallel processor or an application environment of multi-thread processing).

The method for constructing a scene graph based on finite labels in the embodiment of the present application is described below with reference to a method flowchart shown in fig. 2, where the steps shown in fig. 2 may be executed by a computer device shown in fig. 1. In an implementation, the computer device may be a server, such as a personal computer, a midrange computer, a cluster of computers, and so forth.

The technical scheme provided by the embodiment of the application is described in the following with the accompanying drawings of the specification.

Before describing the method for constructing a scene graph based on finite labels, the following describes a process of obtaining a trained scene graph generation model in the embodiment of the present application.

In this embodiment of the present application, a campus scene image data set may be determined first, specifically, the campus scene data set may be a plurality of image data in a primary school campus scene, and may also be a plurality of image data in a middle school or college campus scene, which is not limited in this embodiment of the present application.

In the embodiment of the application, after the campus scene image data set is determined, entities and entity relations in the campus scene image data set can be subjected to limited labeling in a manual labeling mode, so that a limited image semantic data set of a campus scene to be trained can be obtained.

In a specific implementation process, the limited annotation in the embodiment of the present application may be understood as performing manual annotation on image data of a predetermined number of typical scenes in a campus scene image, for example, performing manual annotation on 3 image data of a primary school campus scene, 4 image data of a middle school campus scene, and 5 image data of a college campus; it can also be understood that the label support processing is performed on a predetermined number of image data of typical scenes in different regions. That is, the limited annotation in the embodiment of the present application can be understood as a complete annotation for some typical campus scenarios.

For example, student A may be noted standing at a desk, student B may be noted sitting behind student A, teacher L may be noted drawing on a blackboard, and the like.

In the embodiment of the present application, a plurality of scene image data sets (for example, scenes such as an indoor home scene, a mall scene, an outdoor sports field, and the like) including entities in a campus scene may be processed in the same manner as that of obtaining a limited image semantic data set of the campus scene to be trained, so that a plurality of scene image semantic data sets including entities in the campus scene may be obtained. After the limited image semantic data set of the campus scene to be trained and a plurality of scene image semantic data sets including entities in the campus scene are obtained, image data in the data sets can be input into a preset scene graph generation model for training, and a plurality of output results are obtained. By adopting the training data set, namely not only the image data set of the campus scene, but also the information containing the entities in the campus scene is adopted, so that more comprehensive relationship information and marking information can be obtained, and the accuracy of scene graph construction is further improved.

In the embodiment of the application, the output result obtained by training the scene graph generation model can be compared with the limited image semantic data set at present, the model parameters of the preset scene graph generation model are adjusted according to the comparison result, then the adjusted scene graph generation model is trained again, the adjusted scene graph generation model is compared with the limited image semantic data set again according to the output result, and the model parameters of the adjusted scene graph generation model are adjusted according to the comparison result. That is to say, the output results of the plurality of scene generation models can be compared with the limited image semantic data set, so that a plurality of comparison results can be obtained, and the model parameters of the preset scene graph generation model can be adjusted according to the comparison results, so as to obtain the trained scene graph generation model.

In a specific implementation process, after the preset scene graph generation model is trained, convergence check may be performed on the trained preset scene graph generation model through an overall loss function, and specifically, a first loss function determined by entity detection processing and a second loss function determined by entity relationship prediction may be subjected to weighted calculation, so that the overall loss function may be obtained. For example, a current neural network (RNN) algorithm satisfying a graph permutation invariance principle may be adopted to add a relationship label prediction error and an entity detection error to the overall loss function, thereby determining the overall loss function. And when the trained preset scene graph generation model is determined to be converged, obtaining the trained scene graph generation model.

In the embodiment of the application, the preset scene graph generation model can be trained through the limited image semantic data sets of the campus scenes to be trained and the scene image semantic data sets containing the entities in the campus scenes, so that the trained scene graph generation model is obtained.

Further, in the embodiment of the present application, after obtaining the trained scenegraph generation model, the image of the scenegraph to be constructed may be processed according to the model, specifically, please refer to the flowchart shown in fig. 2.

Step 201: and acquiring an image of a scene graph to be constructed.

Step 202: and carrying out entity detection processing on the image through the trained scene graph generation model to determine a bounding box and a label corresponding to each entity in the image, wherein the entity corresponds to people and/or articles appearing in the image, and the label is used for representing and identifying information of the entity.

Step 203: and carrying out entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generation model to obtain an initial scene graph of each entity, wherein the initial scene graph comprises the label of each entity and a plurality of relationships corresponding to each entity, and the plurality of relationships corresponding to each entity are used for representing the incidence relationship between the entity and other entities except the entity.

In this embodiment of the present application, an image of a scene graph to be constructed may be acquired, and then the trained scene graph generation model obtained through training in the foregoing manner is used to perform entity detection processing on the image, for example, a Mask R-CNN algorithm may be used to perform processing, so that a bounding box and a label corresponding to each entity in the image may be obtained, that is, after the image is subjected to entity detection processing, a label and bounding box information of each article or person in the image may be obtained, specifically, the label may be an identifier for characterizing information knowing an indicated entity, that is, the label is used to characterize information identifying the entity, such as a person, a table, a chair, and the like, that is, which entity exists in the image may be determined through entity detection processing, and a detection box of the entity, that is, an approximate range of the entity, may be determined.

In this embodiment of the application, after the bounding boxes and the labels corresponding to the entities in the image are determined, the position information of the entities can be determined by adopting a deep network algorithm according to the bounding boxes corresponding to the entities, and then the spatial feature vectors corresponding to the entities in the entity subjects and the entities corresponding to the entity objects in the entities can be obtained according to the position information between the subject entities and the object entities in the entities. It should be noted that, in this embodiment of the application, when determining the relationship characteristic of the first entity, other entities that can establish a relationship with the first entity may be referred to as object entities, the first entity is referred to as a subject entity, that is, any entity in the image may be a subject entity, or may be a guest entity, that is, in any relationship corresponding to a subject entity, the subject entity and the guest entity may be referred to as an entity pair. Further, after the space feature vector of the entity pair is determined, clustering processing can be performed on the space feature vector of the entity pair to determine the space diversity feature vector of the entity pair relationship. That is to say, in the present application, the spatial feature vectors of the entity pairs corresponding to all the subject entities in the image are clustered, so that the spatial diversity feature vectors of the entity pair relationship can be accurately determined.

In the embodiment of the present application, word embedding processing may be performed on the tags of each entity, so that the category tag of the subject entity and the category tag of the object entity in each entity may be determined. Specifically, the method can be understood as processing a pre-stored relationship and a relationship between a subject entity tag or an object entity tag through a neural network, so as to obtain a category tag of the subject entity or a category tag of the object entity tag, where the category tag is used to represent a category attribute of an entity in an image. For example, tables and stools are fixed objects in classrooms, and skipping ropes and soccer balls are movable objects. That is, the category of the subject entity or the label of the object entity is determined by the relationship mapping of the pre-stored relationship and the subject entity or the object entity. Further, the category label of the subject entity and the category label of the object entity in each entity can be subjected to unified vectorization processing, so that the category feature vector of the entity-to-entity relationship can be obtained.

In the embodiment of the present application, the number of entity pairs corresponding to the relationship corresponding to the entity pair may also be counted, that is, the category diversity characteristic of a certain entity pair relationship is determined.

That is to say, in the embodiment of the present application, the category diversity characteristic of the entity-pair relationship, the category characteristic of the entity-pair relationship, the spatial diversity characteristic of the entity-pair relationship, and the spatial characteristic of the entity may be determined by processing the position information and the tag of each entity in the image. In other words, in the embodiment of the application, the trained scene graph generation model can determine not only the relationship of the entities, but also various relationships of the entity pairs, so that the accuracy of constructing the scene graph is improved.

Further, in this embodiment of the present application, feature vector selection processing may be performed on a spatial feature vector of each entity pair, a category feature vector of an entity pair relationship, a spatial diversity feature vector, and a category diversity feature vector.

In a specific implementation process, a CART decision tree can be used to perform preliminary heuristic feature selection on the spatial feature vector, the category feature vector, the spatial diversity feature vector and the category diversity feature vector of each entity pair relationship, that is, to determine features significantly related to the relationship from a plurality of feature vectors. Specifically, the determined multiple features may be input to a graph-based factor generation model, so that each entity in the image dataset may be predicted to obtain a relationship corresponding to the entity.

In the embodiment of the present application, the processed information and the tags may be selected according to the features to obtain the initial scene graph of each entity, that is, the initial scene graph of each entity is obtained according to the multiple relationships corresponding to the entities and the tags of the entities.

In a possible implementation manner, global information of the to-be-constructed scene graph may also be determined, where the global information includes information related to a specific scene corresponding to the to-be-constructed scene graph, and then the global information is added to the initial scene graph, so that the global scene graph of the image may be obtained, where the global scene graph includes entities in the image, diverse relationships corresponding to the entities, and school type information and scene area information corresponding to the image. That is, the determined labels and corresponding relationships of the entities, and the school type information and scene area information corresponding to the image can be simultaneously formed in a global scene graph, so that a scene graph with more comprehensive entity relationships and complete image information can be obtained.

Based on the same inventive concept, the embodiment of the application provides a device for constructing a scene graph by using limited tags, and the device for constructing the scene graph by using the limited tags can realize the corresponding functions of the method for constructing the scene graph by using the limited tags. The apparatus for constructing the scene graph by the limited tags can be a hardware structure, a software module or a hardware structure plus a software module. The device for constructing the scene graph by the limited tags can be realized by a chip system, and the chip system can be composed of a chip and can also comprise the chip and other discrete devices. Referring to fig. 3, the apparatus for constructing a scene graph by using finite tags includes an obtaining unit 301, an entity detection processing unit 302, and a generating unit 303. Wherein:

an acquiring unit 301, configured to acquire an image of a scene graph to be constructed;

an entity detection processing unit 302, configured to perform entity detection processing on the image through a trained scene graph generation model to determine a bounding box and a label corresponding to each entity in the image, where the entity corresponds to a person and/or an article appearing in the image, and the label is used for representing information for identifying the entity;

a generating unit 303, configured to perform entity relationship prediction processing on the bounding box and the label of each entity through the trained scene graph generation model to obtain an initial scene graph of each entity, where the initial scene graph includes the label of each entity and multiple relationships corresponding to each entity, and the multiple relationships corresponding to each entity are used to represent association relationships between the entity and other entities except for the entity.

In a possible implementation, the generating unit 303 is configured to:

In a possible embodiment, the model training unit is configured to:

All relevant contents of each step related to the foregoing method for constructing a scene graph by using a limited tag as shown in fig. 2 in an embodiment of the present application may be cited to a functional description of a functional module corresponding to a device for constructing a scene graph by using a limited tag, which is not described herein again.

The division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation, and in addition, each functional unit in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one unit by two or more units. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

Based on the same inventive concept, an embodiment of the present application further provides a computer device, as shown in fig. 4, the computer device in the embodiment of the present application includes at least one processor 401, and a memory 402 and a communication interface 403 connected to the at least one processor 401, a specific connection medium between the processor 401 and the memory 402 is not limited in the embodiment of the present application, a connection between the processor 401 and the memory 402 through a bus 400 is taken as an example in fig. 4, the bus 400 is represented by a thick line in fig. 4, and a connection manner between other components is only schematically illustrated and is not limited. The bus 400 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 4 for ease of illustration, but does not represent only one bus or type of bus.

In the embodiment of the present application, the memory 402 stores instructions executable by the at least one processor 401, and the at least one processor 401 may execute the steps included in the foregoing method for constructing a scene graph by using a limited tag by executing the instructions stored in the memory 402.

The processor 401 is a control center of the computer device, and may connect various parts of the entire fault detection device by using various interfaces and lines, and perform various functions and process data of the computer device by operating or executing instructions stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring on the computer device. Optionally, the processor 401 may include one or more processing units, and the processor 401 may integrate an application processor and a modem processor, where the processor 401 mainly processes an operating system, a user interface, an application program, and the like, and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401. In some embodiments, processor 401 and memory 402 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 401 may be a general-purpose processor, such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, that may implement or perform the methods, steps, and logic blocks of the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method provided in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 402, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 402 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 402 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 402 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data. The communication interface 403 is a transmission interface that can be used for communication, and data can be received or transmitted through the communication interface 403.

With reference to the further structural schematic diagram of the computer apparatus shown in fig. 5, the computer apparatus also includes a basic input/output system (I/O system) 501 for facilitating information transfer between the various devices within the computer apparatus, and a mass storage device 505 for storing an operating system 502, application programs 503 and other program modules 504.

The basic input/output system 501 comprises a display 506 for displaying information and an input device 507, such as a mouse, keyboard, etc., for user input of information. Wherein a display 506 and an input device 507 are coupled to the processor 401 through the basic input/output system 501 coupled to the system bus 400. The basic input/output system 501 may also include an input/output controller for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, an input-output controller may also provide output to a display screen, a printer, or other type of output device.

The mass storage device 505 is connected to the processor 401 through a mass storage controller (not shown) connected to the system bus 400. The mass storage device 505 and its associated computer-readable media provide non-volatile storage for the server package. That is, the mass storage device 505 may include a computer readable medium (not shown), such as a hard disk or CD-ROM drive.

According to various embodiments of the present application, the computer device package may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device may be connected to the network 508 via the communication interface 403 coupled to the system bus 400, or may be connected to another type of network or remote computer system (not shown) using the communication interface 403.

In an exemplary embodiment, there is also provided a storage medium comprising instructions, such as a memory 402 comprising instructions, executable by a processor 401 of an apparatus to perform the method described above. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In some possible embodiments, various aspects of the method for constructing a scene graph with limited tags provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the method for constructing a scene graph with limited tags according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for constructing a scene graph based on limited tags, the method comprising:

acquiring an image of a scene graph to be constructed;

2. The method of claim 1, wherein performing entity relationship prediction processing on the bounding boxes and labels of the respective entities through the trained scene graph generation model comprises:

3. The method of claim 2, wherein performing entity relationship prediction processing on the bounding boxes and labels of the entities through the trained scene graph generation model to obtain an initial scene graph of the entities comprises:

4. The method of claim 1, wherein the method further comprises:

5. The method of claim 1, wherein the trained scenegraph generative model is trained by:

6. The method of claim 5, wherein adjusting the model parameters of the preset scenegraph generative model according to the comparison results to obtain a trained scenegraph generative model comprises:

7. An apparatus for constructing a scene graph based on finite labels, the apparatus comprising:

8. The apparatus of claim 7, wherein the entity-to-relationship prediction unit is specifically configured to:

9. A computer device, characterized in that the computer device comprises:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory and for executing the steps comprised in the method of any one of claims 1 to 6 in accordance with the obtained program instructions.

10. A storage medium storing computer-executable instructions for causing a computer to perform the steps comprising the method of any one of claims 1-6.