CN118172546A - Model generation method, detection device, electronic equipment, medium and product - Google Patents

Model generation method, detection device, electronic equipment, medium and product Download PDF

Info

Publication number
CN118172546A
CN118172546A CN202410579543.7A CN202410579543A CN118172546A CN 118172546 A CN118172546 A CN 118172546A CN 202410579543 A CN202410579543 A CN 202410579543A CN 118172546 A CN118172546 A CN 118172546A
Authority
CN
China
Prior art keywords
target
feature
picture
detected
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410579543.7A
Other languages
Chinese (zh)
Inventor
姚成辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202410579543.7A priority Critical patent/CN118172546A/en
Publication of CN118172546A publication Critical patent/CN118172546A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a model generation method, a detection method, a device, electronic equipment, a medium and a product, wherein the model generation method comprises the following steps: acquiring sample data, wherein the sample data comprises a sample picture, first description information and label information; extracting features of the sample picture based on the initial model to obtain picture features, and extracting features of the first description information based on the initial model to obtain text features; carrying out feature fusion on the picture features and the text features based on the initial model to obtain target features; predicting the position of a target to be detected in a sample picture by utilizing target characteristics based on an initial model to obtain a prediction result; and carrying out parameter optimization on the initial model based on the prediction result and the label information to obtain a target model. The range of target detection based on the model generated by the technical scheme of the application is not limited by the preset category range, thereby being beneficial to realizing the detection of any target.

Description

Model generation method, detection device, electronic equipment, medium and product
Technical Field
The application relates to the technical field of computers, in particular to a model generation method, a detection device, electronic equipment, media and products.
Background
The object detection is a core technology in the current computer field, has extremely high research and application value in the aspects of video monitoring, automatic driving, scene understanding and the like, and is used for detecting whether objects exist in a given image and determining semantic types, object positions and other information of the objects. However, the conventional target detection method generally can only detect the target within a predefined category range, but cannot detect the target outside the predefined category range, which results in a problem of limited target detection range.
Disclosure of Invention
The embodiment of the application aims to provide a model generation method, a detection method, a device, electronic equipment, a medium and a product, which can solve the problem that a target detection range of a target detection model in the related art is limited in the process of target detection.
In a first aspect, an embodiment of the present application provides a method for generating a model, where the method includes:
Acquiring sample data, wherein the sample data comprises a sample picture, first description information and label information, the first description information is description information of an object to be detected in the sample picture, and the label information is used for indicating the position of the object to be detected in the sample picture;
extracting features of the sample picture based on an initial model to obtain picture features, and extracting features of the first description information based on the initial model to obtain text features;
performing feature fusion on the picture features and the text features based on the initial model to obtain target features;
predicting the position of the target to be detected in the sample picture by utilizing the target characteristics based on the initial model to obtain a prediction result;
and carrying out parameter optimization on the initial model based on the prediction result and the label information to obtain a target model.
In a second aspect, an embodiment of the present application provides a target detection method, where the method includes:
Acquiring data to be detected, wherein the data to be detected comprises a picture to be detected and second description information, and the second description information is description information of a target to be detected in the picture to be detected;
Extracting features of the picture to be detected based on a target model to obtain picture features, and extracting features of the second description information based on the target model to obtain text features;
performing feature fusion on the picture features and the text features based on the target model to obtain target features;
And predicting the position of the target to be detected in the picture to be detected by utilizing the target characteristics based on the target model to obtain a prediction result.
In a third aspect, an embodiment of the present application provides a model generating apparatus, including:
The device comprises a first acquisition module, a second acquisition module and a first detection module, wherein the first acquisition module is used for acquiring sample data, the sample data comprises a sample picture, first description information and label information, the first description information is description information of an object to be detected in the sample picture, and the label information is used for indicating the position of the object to be detected in the sample picture;
The first feature extraction module is used for carrying out feature extraction on the sample picture based on an initial model to obtain picture features, and carrying out feature extraction on the first description information based on the initial model to obtain text features;
the first feature fusion module is used for carrying out feature fusion on the picture features and the text features based on the initial model to obtain target features;
The first prediction module is used for predicting the position of the target to be detected in the sample picture by utilizing the target characteristics based on the initial model to obtain a prediction result;
and the optimization module is used for carrying out parameter optimization on the initial model based on the prediction result and the label information to obtain a target model.
In a fourth aspect, an embodiment of the present application provides an object detection apparatus, including:
The second acquisition module is used for acquiring data to be detected, wherein the data to be detected comprises a picture to be detected and second description information, and the second description information is the description information of a target to be detected in the picture to be detected;
the second feature extraction module is used for carrying out feature extraction on the picture to be detected based on a target model to obtain picture features, and carrying out feature extraction on the second description information based on the target model to obtain text features;
the second feature fusion module is used for carrying out feature fusion on the picture features and the text features based on the target model to obtain target features;
And the second prediction module is used for predicting the position of the target to be detected in the picture to be detected by utilizing the target characteristics based on the target model to obtain a prediction result.
In a fifth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor, a memory, and a program stored on the memory and executable on the processor, where the program is executed by the processor to implement the steps of the method according to the first or second aspect.
In a sixth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first or second aspect.
In a seventh aspect, embodiments of the present application also provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method according to the first or second aspect.
In the embodiment of the application, the generated target model can identify the target to be detected in the received picture according to the description information input by the user, and the user can self-define the target to be detected through the description information, so that the target detection range of the target model is not limited by the preset category range, thereby being beneficial to realizing the detection of any target.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is a schematic flow chart of a model generating method according to an embodiment of the present application;
FIG. 2 is a second flow chart of a model generating method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a target detection method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a model generating device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an object detection device according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The method provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.
As shown in fig. 1, a flow chart of a model generating method according to an embodiment of the present application is provided, where the model generating method includes the following steps:
Step 101, obtaining sample data, wherein the sample data comprises a sample picture, first description information and label information, the first description information is description information of a target to be detected in the sample picture, and the label information is used for indicating the position of the target to be detected in the sample picture;
102, extracting features of the sample picture based on an initial model to obtain picture features, and extracting features of the first description information based on the initial model to obtain text features;
step 103, carrying out feature fusion on the picture features and the text features based on the initial model to obtain target features;
Step 104, predicting the position of the target to be detected in the sample picture by utilizing the target characteristics based on the initial model to obtain a prediction result;
and 105, performing parameter optimization on the initial model based on the prediction result and the label information to obtain a target model.
The sample picture can be a picture obtained in various scenes, for example, a monitoring picture in a monitoring scene, or an off-vehicle real-time picture taken by a vehicle-mounted camera in the running process of the vehicle, etc.
The first description information may be text description information of a target to be detected, which is defined by a user, for example, when the sample picture is a picture taken by monitoring a construction site, the first description information may be: and detecting a person wearing the safety helmet, wherein the target to be detected is the person wearing the safety helmet in the sample picture. Or the first description information may be: and detecting a person who does not wear the safety helmet, wherein the target to be detected is the person who does not wear the safety helmet in the sample picture. For another example, when the sample picture is an off-vehicle real-time picture taken by the vehicle-mounted camera during the running process of the vehicle, the first description information may be: and detecting a vehicle, wherein the target to be detected is the vehicle in the sample picture.
The tag information may include position information of the object to be detected in the sample picture described by the first description information, for example, when the sample picture is a picture taken by monitoring a construction site, the first description information is: when detecting the person wearing the safety helmet, the tag information can comprise the position of the person wearing the safety helmet in the sample picture, and the tag information can be in the form of: (x, y, w, h), wherein x, y are the coordinates of the center point of the target, and w, h are the length and width of the target frame to which the target corresponds.
Referring to fig. 2, in some embodiments of the present application, the initial model may include a visual feature extraction (backup) module, an encoder (encoder) module, and a decoder (decoder) module, where the backup module is configured to perform the steps 102 and 103. The encode module is configured to execute step 104 described above. The decode module is configured to execute step 105 described above.
The encoder may be composed of a visual encoder and a text encoder. The visual encoder is used for extracting features of the sample picture to obtain picture features, wherein in order to give consideration to target detection of different sizes, the visual encoder can extract multi-scale features by adopting ResNet trunks. The text encoder is used for extracting features of the first description information to obtain text features, wherein the text encoder adopts Bert to extract the text features, texts comprise words, sentences and the like, attention mask marks are introduced to detect the main bodies, and when a plurality of detect main bodies appear in the first description information, target distinction is carried out. Specifically, the text encoder marks the target to be detected in the first description information through attention mask, namely the text feature can include the mark information of the target to be detected, so that the position of the target to be detected can be detected by the subsequent model. For example, when the first description information is "detect pen on desk", the first description information includes the following two detection subjects: the table and the pen can be used for marking the detection main body as the pen through attention mask, namely the target to be detected is the pen.
The above feature fusion of the picture feature and the text feature may be performed by using various feature fusion methods, for example, a cross-attention mechanism "cross attention" may be used to fuse the picture feature and the text feature.
The prediction result may include a predicted position of the target to be detected in the sample picture, so that a loss of the initial model may be determined according to the predicted position and the real position indicated by the tag information, and parameter optimization may be performed on the initial model based on the loss of the initial model, to obtain the target model.
It will be appreciated that the above embodiment only uses one sample data to illustrate the training process of the initial model, in fact, a large number of different sample data may be generated, and the initial model may be iteratively trained according to the method in the above embodiment using each sample data until the model converges to obtain the target model, where whether the model converges may be determined by a test set, for example, when the accuracy of target detection of test data in the test set by the trained model is higher than a target threshold, which may be a relatively high value, for example, 95%, 98%, or the like.
In this embodiment, the generated target model may identify the target to be detected in the received picture according to the description information input by the user, and since the user may describe the target to be detected through the description information, the range of the target model for detecting the target is not limited by the preset category range, thereby being beneficial to implementing the detection of any target.
Optionally, the picture feature includes M different-scale first feature vectors, where the M different-scale first feature vectors are: generating feature vectors based on M feature images with different scales corresponding to the sample image, wherein the M first feature vectors with different scales are in one-to-one correspondence with the M feature images with different scales, the text features comprise second feature vectors, and M is an integer larger than 1;
The feature fusion is carried out on the picture feature and the text feature based on the initial model to obtain a target feature, and the method comprises the following steps:
Respectively carrying out feature fusion on the second feature vector and each first feature vector based on a cross attention mechanism to obtain M third feature vectors;
and carrying out feature fusion on the M third feature vectors to obtain the target feature.
The M different scale first feature vectors may be multi-scale features extracted by the visual encoder using ResNet trunks. Specifically, the sample image can be scaled to different resolutions, and feature extraction can be performed on the images with different resolutions, so as to obtain picture features with different scales.
The feature fusion is performed on the second feature vector and each first feature vector based on the cross attention mechanism, and the obtaining of M third feature vectors specifically refers to: and carrying out feature fusion on the second feature vector and each first feature vector independently to obtain a corresponding third feature vector, wherein M third feature vectors are in one-to-one correspondence with the M first feature vectors. For easy understanding, the process of feature fusion of the second feature vector with the t first feature vector is taken as an example, and the process of feature fusion of the second feature vector with each first feature vector based on the cross attention mechanism is further explained below:
will be the t first eigenvector Projecting to a query matrix to obtain: /(I)Wherein the t first feature vector may be any one of the M first feature vectors of different scales;
Second feature vector Projecting to a key matrix to obtain: /(I); At the same time, the second feature vectorProjecting to a value matrix to obtain: /(I)Wherein/>The main calculation method of the linear projective transformation matrix cross attention obtained through learning is as follows:
The elements in M are defined as attention weights of pixel feature vectors, M being the cross-attention representation of the picture features and text features, where d is the projected dimension of the key and query. The main mode of feature fusion is that a text sequence and an image sequence are scaled to the same dimension, cross attention is adopted to distribute weights for each element in the image feature sequence by calculating Q, K, V association degrees on image features of different scales, cross attention is mainly used for capturing the association between the text sequence and different positions in a visual sequence, the dependency relationship between the text and the visual features is established, feature enhancement is realized, multi-modal features are output, and therefore the multi-modal features of different scales, namely the M third feature vectors, can be obtained.
In the embodiment, in order to enable the text information to better guide target detection, feature enhancement is performed on the picture features with different scales, namely cross attention is adopted to perform cross-modal feature fusion on the picture features with different scales by utilizing the text features, so that the effect of feature fusion is improved.
Optionally, the feature fusion is performed on the M third feature vectors to obtain the target feature, including:
Performing M-1 iterative fusion on the M third feature vectors according to a preset sequence to obtain the target feature, wherein the preset sequence is a sequence of sequencing the M third feature vectors from high to low according to the scale of the corresponding feature map; the ith fusion of the M-1 iterative fusion comprises:
Downsampling an ith vector to be fused to obtain an ith sampling vector, wherein the ith vector to be fused is a1 st third feature vector in the preset sequence under the condition that i is equal to 1; in the case that i is greater than 1, the i-th vector to be fused is the i-1-th fusion vector; the characteristic dimension of the ith sampling vector is the same as the characteristic dimension of the (i+1) th third characteristic vector in the preset sequence, wherein i is a positive integer smaller than M;
And carrying out feature fusion on the ith sampling vector and the (i+1) th third feature vector to obtain an ith fusion vector, wherein the target feature is the (M-1) th fusion vector.
The dimensions of the feature map are dimensions of the feature map, that is, dimensions of an image after scaling the sample image, for example, please refer to fig. 2, after feature extraction is performed on the sample image by a visual encoder (vision) to obtain three feature maps with different dimensions, wherein the dimensions of the feature maps gradually decrease from left to right. For easy understanding, the process of feature fusion between the above-mentioned picture feature and text feature is explained below by taking the embodiment shown in fig. 2 as an example, where in fig. 2, the value of N is 3, and in fig. 2, three feature diagrams from left to right are respectively recorded as: the method comprises the steps of a1 st feature map, a2 nd feature map and a3 rd feature map, wherein the 1 st feature map is higher in scale than the 2 nd feature map, and the 2 nd feature map is higher in scale than the 3 rd feature map. The process of feature fusion between the picture features and the text features comprises the following steps:
Based on cross attention, carrying out feature fusion on a second feature vector obtained by extracting features of the first description information by the text encoder and a first feature vector corresponding to the 1 st feature map to obtain a 1 st third feature vector; based on cross attention, carrying out feature fusion on the second feature vector and the first feature vector corresponding to the 2 nd feature map to obtain a2 nd third feature vector; based on cross attention, carrying out feature fusion on the second feature vector and the first feature vector corresponding to the 3 rd feature map to obtain a 3 rd third feature vector;
Downsampling the 1 st third feature vector to obtain a1 st sampling vector, wherein the feature dimension of the 1 st sampling vector is the same as the feature dimension of the 2 nd third feature vector; then, the 1 st sampling vector and the 2 nd third feature vector are fused, for example, the 1 st sampling vector and the 2 nd third feature vector can be added to obtain the 1 st fusion vector;
Then, downsampling the 1 st fusion vector to obtain a 2 nd sampling vector, wherein the characteristic dimension of the 2 nd sampling vector is the same as the characteristic dimension of the 3 rd third characteristic vector; then, the 2 nd sampling vector is fused with the 3 rd third feature vector, for example, the 2 nd sampling vector and the 3 rd third feature vector can be added to obtain a 2 nd fusion vector; and the 2 nd fusion vector is taken as the target vector.
In this embodiment, after multi-mode features of different dimensions are obtained, feature fusion is performed by downsampling from high dimensions to low dimensions, that is, attention results of targets of various dimensions are fused, so that the pair of shallow feature information is avoided, the effect of target detection is improved by fusing high-low layer features, the fused target features contain a large amount of local information, and finally, the result of the encoder part is output through a feedforward neural network, so that the effect of feature fusion is improved.
Optionally, the prediction result includes a predicted position of the target to be detected in the sample picture, and the parameter optimization is performed on the initial model based on the prediction result and the tag information to obtain a target model, including:
Determining a first loss based on the predicted position and the label information, and determining a second loss based on the image feature corresponding to the predicted position and the text feature, wherein the image feature corresponding to the predicted position is: extracting features obtained by extracting features from the predicted positions in the sample picture;
and carrying out parameter optimization on the initial model based on the first loss and the second loss to obtain a target model.
Wherein the first loss may be a targeted positioning loss and the second loss may be a classification loss. In the process of calculating the second loss, text features can be obtained through backnone, and then similarity between the text features and region frame features is calculated to obtain a matching score, wherein the matching score is used for representing the second loss, and the region frame features are features obtained by extracting features from the predicted positions in the sample picture.
The above-mentioned parameter optimization of the initial model may be specifically performed by M,、/>Or/>And (5) optimizing.
In the embodiment, the initial model is optimized based on the first loss and the second loss, so that accuracy of recognition of the position and the category of the target by the target model obtained through training is improved.
Optionally, the predicting, based on the initial model, the position of the target to be detected in the sample picture by using the target feature, to obtain a prediction result, includes:
And predicting the position of the target to be detected in the sample picture based on the initial model by utilizing the target characteristics and the target query vector to obtain a prediction result, wherein the number of dimensions of the target query vector is larger than a preset threshold, and the preset threshold is the maximum number of targets to be detected included in the sample picture.
The target query vector may be represented as an object query, where the object query may be an initialized query vector, i.e., all values in the object query are 0.
Specifically, the object features of the encode output and the object queries can be input into a decoder (transform decoder), and finally the transform decoder output is sent to a feedforward neural network to predict the object class and the bounding box, the decode part outputs a length N, each output contains the class and the center absolute coordinate information of one possible object, if the number of the objects in the sample picture is less than N, the redundant positions of the object queries are emptied, and the post-processing steps of the traditional detection algorithm on nms, an anchor generator and the like which depend on manual priori are completely removed. Wherein N is the dimension number of object queries.
In this embodiment, the position of the target to be detected in the sample picture is predicted by using the target feature and the target query vector based on the initial model to obtain a prediction result, so that a detection result is obtained without relying on a post-processing step of manual priori, thereby being beneficial to simplifying the target detection process.
In summary, the method provided by the embodiment of the application has at least the following beneficial effects:
The text information is introduced into target detection, a visual and text relation is established, and the model is trained by using image-text, so that the target model obtained through training can identify targets of any category according to text prompt.
A deeper fusion is introduced between image and text features, multi-scale depth cross-modal fusion is adopted in the encoding part, the cross-modal capacity of the model is enhanced, and the detection capacity of targets with various sizes is improved.
Referring to fig. 3, fig. 3 is a schematic diagram of a target detection method according to an embodiment of the present application, where the method includes the following steps:
Step 301, obtaining data to be detected, wherein the data to be detected comprises a picture to be detected and second description information, and the second description information is description information of a target to be detected in the picture to be detected;
Step 302, extracting features of the picture to be detected based on a target model to obtain picture features, and extracting features of the second description information based on the target model to obtain text features;
Step 303, carrying out feature fusion on the picture features and the text features based on the target model to obtain target features;
and step 304, predicting the position of the target to be detected in the picture to be detected by utilizing the target characteristics based on the target model to obtain a prediction result.
It can be understood that the target model is a model trained based on the model generation method in the above embodiment.
The picture to be detected may be a picture including various types of objects to be detected. The second description information may be text description information of the target to be detected, which is defined by the user, and the picture to be detected includes the target to be detected described by the second description information.
The implementation process of step 302 is similar to that of step 102 in the above embodiment, and is not repeated here. The implementation process of step 303 is similar to that of step 103 in the above embodiment, and is not repeated here. The implementation process of step 304 is similar to that of step 104 in the above embodiment, and is not repeated here.
The implementation is similar to the above embodiment in the specific implementation process, and has the same beneficial effects, so that repetition is avoided and no detailed description is given here.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a model generating apparatus 400 according to an embodiment of the present application, where the apparatus includes:
A first obtaining module 401, configured to obtain sample data, where the sample data includes a sample picture, first description information and tag information, the first description information is description information of an object to be detected in the sample picture, and the tag information is used to indicate a position of the object to be detected in the sample picture;
A first feature extraction module 402, configured to perform feature extraction on the sample picture based on an initial model to obtain a picture feature, and perform feature extraction on the first description information based on the initial model to obtain a text feature;
A first feature fusion module 403, configured to perform feature fusion on the picture feature and the text feature based on the initial model, so as to obtain a target feature;
A first prediction module 404, configured to predict, based on the initial model, a position of the target to be detected in the sample picture by using the target feature, so as to obtain a prediction result;
and the optimizing module 405 is configured to perform parameter optimization on the initial model based on the prediction result and the tag information, so as to obtain a target model.
Optionally, the picture feature includes M different-scale first feature vectors, where the M different-scale first feature vectors are: generating feature vectors based on M feature images with different scales corresponding to the sample image, wherein the M first feature vectors with different scales are in one-to-one correspondence with the M feature images with different scales, the text features comprise second feature vectors, and M is an integer larger than 1; the first feature fusion module 403 is specifically configured to perform feature fusion on the second feature vector and each first feature vector based on a cross attention mechanism, so as to obtain M third feature vectors;
The first feature fusion module 403 is further configured to perform feature fusion on the M third feature vectors to obtain the target feature.
Optionally, the first feature fusion module 403 is specifically configured to perform M-1 iterative fusion on the M third feature vectors according to a preset sequence, so as to obtain the target feature, where the preset sequence is a sequence in which the M third feature vectors are ordered from high to low according to a scale of the corresponding feature map; the ith fusion of the M-1 iterative fusion comprises:
Downsampling an ith vector to be fused to obtain an ith sampling vector, wherein the ith vector to be fused is a1 st third feature vector in the preset sequence under the condition that i is equal to 1; in the case that i is greater than 1, the i-th vector to be fused is the i-1-th fusion vector; the characteristic dimension of the ith sampling vector is the same as the characteristic dimension of the (i+1) th third characteristic vector in the preset sequence, wherein i is a positive integer smaller than M;
And carrying out feature fusion on the ith sampling vector and the (i+1) th third feature vector to obtain an ith fusion vector, wherein the target feature is the (M-1) th fusion vector.
Optionally, the prediction result includes a predicted position of the target to be detected in the sample picture, and the optimizing module 405 includes:
The determining submodule is used for determining a first loss based on the predicted position and the label information and determining a second loss based on the image feature corresponding to the predicted position and the text feature, wherein the image feature corresponding to the predicted position is: extracting features obtained by extracting features from the predicted positions in the sample picture;
And the optimization sub-module is used for carrying out parameter optimization on the initial model based on the first loss and the second loss to obtain a target model.
Optionally, the first prediction module 404 is specifically configured to predict, based on the initial model, a position of the target to be detected in the sample picture by using the target feature and a target query vector, to obtain a prediction result, where the number of dimensions of the target query vector is greater than a preset threshold, and the preset threshold is a maximum number of targets to be detected included in the sample picture.
It should be noted that, the model generating device 400 provided in the embodiment of the present application can implement all technical processes of the model generating method shown in the embodiment of fig. 1, and achieve the same technical effects, and in order to avoid repetition, a detailed description is omitted here.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an object detection device 500 according to an embodiment of the present application, where the device includes:
A second obtaining module 501, configured to obtain data to be detected, where the data to be detected includes a picture to be detected and second description information, where the second description information is description information of a target to be detected in the picture to be detected;
The second feature extraction module 502 is configured to perform feature extraction on the to-be-detected picture based on a target model to obtain a picture feature, and perform feature extraction on the second description information based on the target model to obtain a text feature;
a second feature fusion module 503, configured to perform feature fusion on the picture feature and the text feature based on the target model, so as to obtain a target feature;
and the second prediction module 504 is configured to predict, based on the target model, a position of the target to be detected in the picture to be detected by using the target feature, so as to obtain a prediction result.
It should be noted that, the object detection apparatus 500 provided in the embodiment of the present application can implement all the technical processes of the object detection method as shown in the embodiment of fig. 3, and achieve the same technical effects, and is not repeated herein.
The embodiment of the application also provides electronic equipment, which comprises: the program is executed by the processor to implement the model generating method shown in fig. 1 or implement the processes of the embodiment of the target detection method shown in fig. 3, and the same technical effects can be achieved, so that repetition is avoided and no further description is given here.
Specifically, referring to fig. 6, an embodiment of the present application further provides an electronic device, including a bus 601, a transceiver 602, an antenna 603, a bus interface 604, a processor 605, and a memory 606.
In this embodiment, the electronic device further includes: a computer program stored on the memory 606 and executable on the processor 605. The computer program may implement the model generating method as shown in the embodiment of fig. 1 or implement the respective processes of the target detection method shown in fig. 3 when executed by the processor 605, and may achieve the same technical effects, and for avoiding repetition, a detailed description is omitted here.
In fig. 6, a bus architecture (represented by bus 601), the bus 601 may include any number of interconnected buses and bridges, with the bus 601 linking together various circuits, including one or more processors, represented by processor 605, and memory, represented by memory 606. The bus 601 may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. Bus interface 604 provides an interface between bus 601 and transceiver 602. The transceiver 602 may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 605 is transmitted over a wireless medium via an antenna 603, and further, the antenna 603 also receives data and transmits the data to the processor 605.
The processor 605 is responsible for managing the bus 601 and general processing, and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 606 may be used to store data used by processor 605 in performing operations.
Alternatively, the processor 605 may be CPU, ASIC, FPGA or a CPLD.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements each process of the method embodiment shown in fig. 1 or fig. 3 and achieves the same technical effects, and is not repeated herein. Wherein the computer readable storage medium is such as ROM, RAM, magnetic or optical disk.
The embodiment of the present application further provides a computer program product, which includes computer instructions, where the computer instructions, when executed by a processor, implement each process of the method embodiment shown in fig. 1 or fig. 3 and achieve the same technical effects, and are not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (11)

1. A method of generating a model, the method comprising:
Acquiring sample data, wherein the sample data comprises a sample picture, first description information and label information, the first description information is description information of an object to be detected in the sample picture, and the label information is used for indicating the position of the object to be detected in the sample picture;
extracting features of the sample picture based on an initial model to obtain picture features, and extracting features of the first description information based on the initial model to obtain text features;
performing feature fusion on the picture features and the text features based on the initial model to obtain target features;
predicting the position of the target to be detected in the sample picture by utilizing the target characteristics based on the initial model to obtain a prediction result;
and carrying out parameter optimization on the initial model based on the prediction result and the label information to obtain a target model.
2. The method of claim 1, wherein the picture feature comprises M different scale first feature vectors, the M different scale first feature vectors being: generating feature vectors based on M feature images with different scales corresponding to the sample image, wherein the M first feature vectors with different scales are in one-to-one correspondence with the M feature images with different scales, the text features comprise second feature vectors, and M is an integer larger than 1;
The feature fusion is carried out on the picture feature and the text feature based on the initial model to obtain a target feature, and the method comprises the following steps:
Respectively carrying out feature fusion on the second feature vector and each first feature vector based on a cross attention mechanism to obtain M third feature vectors;
and carrying out feature fusion on the M third feature vectors to obtain the target feature.
3. The method according to claim 2, wherein the feature fusion of the M third feature vectors to obtain the target feature includes:
Performing M-1 iterative fusion on the M third feature vectors according to a preset sequence to obtain the target feature, wherein the preset sequence is a sequence of sequencing the M third feature vectors from high to low according to the scale of the corresponding feature map; the ith fusion of the M-1 iterative fusion comprises:
Downsampling an ith vector to be fused to obtain an ith sampling vector, wherein the ith vector to be fused is a1 st third feature vector in the preset sequence under the condition that i is equal to 1; in the case that i is greater than 1, the i-th vector to be fused is the i-1-th fusion vector; the characteristic dimension of the ith sampling vector is the same as the characteristic dimension of the (i+1) th third characteristic vector in the preset sequence, wherein i is a positive integer smaller than M;
And carrying out feature fusion on the ith sampling vector and the (i+1) th third feature vector to obtain an ith fusion vector, wherein the target feature is the (M-1) th fusion vector.
4. A method according to claim 3, wherein the prediction result includes a predicted position of the object to be detected in the sample picture, and the performing parameter optimization on the initial model based on the prediction result and the tag information to obtain an object model includes:
Determining a first loss based on the predicted position and the label information, and determining a second loss based on the image feature corresponding to the predicted position and the text feature, wherein the image feature corresponding to the predicted position is: extracting features obtained by extracting features from the predicted positions in the sample picture;
and carrying out parameter optimization on the initial model based on the first loss and the second loss to obtain a target model.
5. The method according to claim 1, wherein predicting the position of the target to be detected in the sample picture based on the initial model by using the target feature to obtain a prediction result includes:
And predicting the position of the target to be detected in the sample picture based on the initial model by utilizing the target characteristics and the target query vector to obtain a prediction result, wherein the number of dimensions of the target query vector is larger than a preset threshold, and the preset threshold is the maximum number of targets to be detected included in the sample picture.
6. A method of target detection, the method comprising:
Acquiring data to be detected, wherein the data to be detected comprises a picture to be detected and second description information, and the second description information is description information of a target to be detected in the picture to be detected;
Extracting features of the picture to be detected based on a target model to obtain picture features, and extracting features of the second description information based on the target model to obtain text features;
performing feature fusion on the picture features and the text features based on the target model to obtain target features;
And predicting the position of the target to be detected in the picture to be detected by utilizing the target characteristics based on the target model to obtain a prediction result.
7. A model generation apparatus, characterized in that the apparatus comprises:
The device comprises a first acquisition module, a second acquisition module and a first detection module, wherein the first acquisition module is used for acquiring sample data, the sample data comprises a sample picture, first description information and label information, the first description information is description information of an object to be detected in the sample picture, and the label information is used for indicating the position of the object to be detected in the sample picture;
The first feature extraction module is used for carrying out feature extraction on the sample picture based on an initial model to obtain picture features, and carrying out feature extraction on the first description information based on the initial model to obtain text features;
the first feature fusion module is used for carrying out feature fusion on the picture features and the text features based on the initial model to obtain target features;
The first prediction module is used for predicting the position of the target to be detected in the sample picture by utilizing the target characteristics based on the initial model to obtain a prediction result;
and the optimization module is used for carrying out parameter optimization on the initial model based on the prediction result and the label information to obtain a target model.
8. An object detection device, the device comprising:
The second acquisition module is used for acquiring data to be detected, wherein the data to be detected comprises a picture to be detected and second description information, and the second description information is the description information of a target to be detected in the picture to be detected;
the second feature extraction module is used for carrying out feature extraction on the picture to be detected based on a target model to obtain picture features, and carrying out feature extraction on the second description information based on the target model to obtain text features;
the second feature fusion module is used for carrying out feature fusion on the picture features and the text features based on the target model to obtain target features;
And the second prediction module is used for predicting the position of the target to be detected in the picture to be detected by utilizing the target characteristics based on the target model to obtain a prediction result.
9. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method according to any one of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
11. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 6.
CN202410579543.7A 2024-05-11 2024-05-11 Model generation method, detection device, electronic equipment, medium and product Pending CN118172546A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410579543.7A CN118172546A (en) 2024-05-11 2024-05-11 Model generation method, detection device, electronic equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410579543.7A CN118172546A (en) 2024-05-11 2024-05-11 Model generation method, detection device, electronic equipment, medium and product

Publications (1)

Publication Number Publication Date
CN118172546A true CN118172546A (en) 2024-06-11

Family

ID=91358949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410579543.7A Pending CN118172546A (en) 2024-05-11 2024-05-11 Model generation method, detection device, electronic equipment, medium and product

Country Status (1)

Country Link
CN (1) CN118172546A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239765A (en) * 2022-08-02 2022-10-25 合肥工业大学 Infrared image target tracking system and method based on multi-scale deformable attention
CN115797706A (en) * 2023-01-30 2023-03-14 粤港澳大湾区数字经济研究院(福田) Target detection method, target detection model training method and related device
WO2023134073A1 (en) * 2022-01-11 2023-07-20 平安科技(深圳)有限公司 Artificial intelligence-based image description generation method and apparatus, device, and medium
CN116993963A (en) * 2023-09-21 2023-11-03 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN117829243A (en) * 2023-11-20 2024-04-05 科大讯飞华南人工智能研究院(广州)有限公司 Model training method, target detection device, electronic equipment and medium
CN117972118A (en) * 2024-02-02 2024-05-03 深圳须弥云图空间科技有限公司 Target retrieval method, target retrieval device, electronic equipment and storage medium
CN117974971A (en) * 2023-12-22 2024-05-03 中电金信软件有限公司 Target detection method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023134073A1 (en) * 2022-01-11 2023-07-20 平安科技(深圳)有限公司 Artificial intelligence-based image description generation method and apparatus, device, and medium
CN115239765A (en) * 2022-08-02 2022-10-25 合肥工业大学 Infrared image target tracking system and method based on multi-scale deformable attention
CN115797706A (en) * 2023-01-30 2023-03-14 粤港澳大湾区数字经济研究院(福田) Target detection method, target detection model training method and related device
CN116993963A (en) * 2023-09-21 2023-11-03 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN117829243A (en) * 2023-11-20 2024-04-05 科大讯飞华南人工智能研究院(广州)有限公司 Model training method, target detection device, electronic equipment and medium
CN117974971A (en) * 2023-12-22 2024-05-03 中电金信软件有限公司 Target detection method and device, electronic equipment and storage medium
CN117972118A (en) * 2024-02-02 2024-05-03 深圳须弥云图空间科技有限公司 Target retrieval method, target retrieval device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
KR102266529B1 (en) Method, apparatus, device and readable storage medium for image-based data processing
CN111709409A (en) Face living body detection method, device, equipment and medium
CN111984772B (en) Medical image question-answering method and system based on deep learning
CN107273458B (en) Depth model training method and device, and image retrieval method and device
CN115797706B (en) Target detection method, target detection model training method and related device
WO2019099205A1 (en) Generating object embeddings from images
CN115526259A (en) Training method and device for multi-mode pre-training model
CN111488873A (en) Character-level scene character detection method and device based on weak supervised learning
CN113591566A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN117036778A (en) Potential safety hazard identification labeling method based on image-text conversion model
CN117370498B (en) Unified modeling method for 3D open vocabulary detection and closed caption generation
CN116563840B (en) Scene text detection and recognition method based on weak supervision cross-mode contrast learning
CN117314938A (en) Image segmentation method and device based on multi-scale feature fusion decoding
CN116258931B (en) Visual finger representation understanding method and system based on ViT and sliding window attention fusion
CN111539435A (en) Semantic segmentation model construction method, image segmentation equipment and storage medium
CN117011932A (en) Running behavior detection method, electronic device and storage medium
CN111144361A (en) Road lane detection method based on binaryzation CGAN network
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN114092931B (en) Scene character recognition method and device, electronic equipment and storage medium
CN114429631B (en) Three-dimensional object detection method, device, equipment and storage medium
CN118172546A (en) Model generation method, detection device, electronic equipment, medium and product
CN113052156B (en) Optical character recognition method, device, electronic equipment and storage medium
CN114821424A (en) Video analysis method, video analysis device, computer device, and storage medium
CN115311518A (en) Method, device, medium and electronic equipment for acquiring visual attribute information
CN115512375A (en) Training method of text error correction model, text recognition method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination