CN116608866A - Picture navigation method, device and medium based on multi-scale fine granularity feature fusion - Google Patents

Picture navigation method, device and medium based on multi-scale fine granularity feature fusion Download PDF

Info

Publication number
CN116608866A
CN116608866A CN202310890318.0A CN202310890318A CN116608866A CN 116608866 A CN116608866 A CN 116608866A CN 202310890318 A CN202310890318 A CN 202310890318A CN 116608866 A CN116608866 A CN 116608866A
Authority
CN
China
Prior art keywords
layer
scale fine
fine granularity
visual
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310890318.0A
Other languages
Chinese (zh)
Other versions
CN116608866B (en
Inventor
谭明奎
孙鑫宇
陈沛豪
樊琚岗
杜卿
陈健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310890318.0A priority Critical patent/CN116608866B/en
Publication of CN116608866A publication Critical patent/CN116608866A/en
Application granted granted Critical
Publication of CN116608866B publication Critical patent/CN116608866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Remote Sensing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Automation & Control Theory (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a picture navigation method, device and medium based on multi-scale fine granularity feature fusion, and belongs to the technical field of intelligent navigation. The method comprises the following steps: acquiring a target image of a navigation target position; acquiring visual observation of an intelligent agent at the current moment in the environment; inputting the target image and the visual observation into a multi-scale fine granularity characteristic fusion module to carry out multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic; and predicting the state of the intelligent agent at the next moment according to the visual state characteristics so that the intelligent agent performs actions according to the state until reaching the navigation target position. According to the application, the object fine-grained characteristics contained in the hidden layer high-resolution activation map in the deep neural network are utilized, and the characteristics are used as prompts to guide the visual observation model to pay attention to the area which has correlation with the target image in the current environment in low-level attribute and high-level language, so that the capability of an intelligent agent in reasoning and searching the target position in the exploration stage is improved.

Description

Picture navigation method, device and medium based on multi-scale fine granularity feature fusion
Technical Field
The application relates to the technical field of intelligent navigation, in particular to a picture navigation method, device and medium based on multi-scale fine granularity feature fusion.
Background
The appearance of the intelligent body provides an important technical route for improving the cognitive ability of the current artificial intelligence and moving to the general intelligent. Through the channel of interaction with the environment, the intelligent agent can acquire real feedback from a real physical or virtual digital space, thereby further learning and advancing, wherein picture navigation aims at enabling the intelligent agent to conduct autonomous navigation according to a target position appointed by the picture, and the intelligent agent gradually receives wide attention in recent years, has become one of research hotspots with personal intelligence, and has huge potential application value in the aspects of automatic driving, home service robots and the like.
Since the intelligent agent needs to actively infer the semantic category of the object, the spatial relationship between the target position and the surrounding environment and semantic relation from the target picture, picture navigation is a more challenging task compared with object navigation and language navigation.
At present, the existing method provides a map-based modularized mode to realize visual language navigation, and predicts a navigation track by constructing a topological structure diagram of an obstacle map or an environment. In addition, some end-to-end methods utilize reinforcement learning to train the navigation strategy model. However, these picture navigation methods have two main problems: 1) The map-based modular approach relies heavily on accurate GPS coordinate data and expensive depth picture information, limiting their performance in complex unknown environments; 2) The existing end-to-end method models the target picture and the visual observation respectively, and ignores fine-grained description of the target position in the target picture, so that the target position is difficult to find by an intelligent agent in an exploration stage of the initial stage of navigation. Therefore, how to guide the visual observation and understanding process of the intelligent agent to the environment according to the detail attribute information of the target picture only in the absence of the additional accurate sensor data is one of the research hot spots and difficulties of the current picture navigation task.
Disclosure of Invention
In order to solve at least one of the technical problems existing in the prior art to a certain extent, the application aims to provide a picture navigation method, a device and a medium based on multi-scale fine granularity feature fusion.
The technical scheme adopted by the application is as follows:
a picture navigation method based on multi-scale fine granularity feature fusion comprises the following steps:
acquiring a target image of a navigation target position
Obtaining visual observation of current moment of intelligent body in environment
Image of objectAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>
According to visual state characteristicsPredicting the state of the agent at the next moment +.>So that the agent is +_ dependent on the status>The action is performed until the navigation target position is reached.
Further, the target image is displayedAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>Comprising:
processing the image of the object using a first multi-layer convolutional neural networkObtaining an intermediate layer activation diagram of a target image output by different network convolution layers;
processing the visual observations using a second multi-layer convolutional neural networkThe method comprises the steps of carrying out a first treatment on the surface of the The first multi-layer convolutional neural network and the second multi-layer convolutional neural network have the same network layer number and network structure;
constructing a characteristic affine transformer, and mapping an intermediate layer activation graph of a target image into affine transformation coefficients pixel by pixel; obtaining the output of a corresponding network layer of a second multi-layer convolutional neural network, and carrying out affine transformation based on the affine transformation coefficients to obtain the characteristics of the fused middle layer; taking the fused middle layer characteristics as the input of the next layer of network in the second multilayer convolutional neural network;
the output of the last layer of the first multi-layer convolutional neural network,fusing the output of the last layer of the second multi-layer convolutional neural network to obtain visual state characteristics
Further, the expression of the interlayer feature is:
wherein ,is +.>Middle layer feature of layer network output, +.>Is +.>Output of layer network, ++>、/>Is->Affine transformation coefficients of layers,/>、/>Representing the mapping of the conditional variable as a function of affine factors,/->Represents the->Features in each dimension.
Further, the visual state featureThe expression of (2) is:
wherein ,a time observation model conditioned on the target image is represented.
Further, the multi-scale fine granularity feature fusion module adopts a conditional transformation convolution layer as a basic unit of fusion operation.
Further, the visual state featurePredicting the state of the agent at the next moment +.>Comprising:
visual state features to be fusedInputting to a cyclic neural network, and obtaining the status characteristic of the next moment by combining the status prediction of the historical moment +.>
Further, the agent is responsive to the statusPerforming an action until reaching a navigation target location, comprising:
putting the state intoThrough a full connection layer, predicting the distribution of executable actions, and selecting the optimal actions from the predicted distribution; the agent performs this action and eventually reaches the location indicated by the target image.
The application adopts another technical scheme that:
a picture navigation system based on multi-scale fine granularity feature fusion, comprising:
a target acquisition module for acquiring a target image of the navigation target position
An observation acquisition module for acquiring visual observation of the current moment of the intelligent agent in the environment
The feature fusion module is used for fusing the target imageAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>
A state prediction module for predicting visual state according to the visual state characteristicsPredicting the state of the agent at the next moment +.>To make the agent rootAccording to the state->The action is performed until the navigation target position is reached.
The application adopts another technical scheme that:
a picture navigation device based on multi-scale fine granularity feature fusion, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as described above.
The application adopts another technical scheme that:
a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.
The beneficial effects of the application are as follows: according to the application, the object fine-grained characteristics contained in the hidden layer high-resolution activation graph in the deep neural network are utilized, and the characteristics are used as prompts to guide the visual observation model to pay attention to the area which has correlation with the target image in the current environment in low-level attribute and high-level language, so that the capability of an intelligent agent in reasoning and searching the target position in the exploration stage is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present application or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present application, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.
FIG. 1 is a flow chart of steps of a picture navigation method based on multi-scale fine granularity feature fusion in an embodiment of the application;
FIG. 2 is a schematic diagram of a multi-scale fine-grained feature fusion module in an embodiment of the application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the description of the present application, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present application.
In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
As shown in fig. 1, the present embodiment provides a picture navigation method based on multi-scale fine granularity feature fusion, which includes the following steps:
s1, acquiring a target image of a navigation target position
S2, obtaining visual observation of the intelligent agent at the current moment in the environment
Visual images observed by the intelligent agent in the simulation environment are acquired, wherein the visual images comprise RGB images shot by a target position and observed at each moment, and the embodiment uses a public simulator habit-sim and uses a public data set ImageNav as training and testing data.
S3, target imageAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>
Referring to fig. 2, the proposed multi-scale fine-grained feature fusion module inputs a picture taken for a target locationVisual observation of the current moment +.>The output is the visual state characteristics after fusion +.>. The method aims at injecting a fine-grained and high-resolution activation map in a target picture model into a visual observation model as a prompt, and guiding the visual observation model to pay attention to a region related to a target position in visual observation according to semantic information and detailed attribute information in a feature map.
The interpretability work of the existing neural network shows that the characteristics of different hidden layers of the neural network can acquire different types of information of objects in an image, wherein shallow layer characteristics can generally extract local details of the objects, and deep layer characteristics can generally extract global semantic information of the objects. Therefore, when multi-scale fine-granularity feature fusion is executed, the activation diagram of a plurality of layers of network layers is considered instead of the final output semantic vector of the model, so that the model can fuse shallow layer features and deep layer features at the same time, different network layers of the visual observation model can obtain condition input of corresponding layers to execute fusion operation, and ambiguity of the fusion process is reduced.
In order to perform fine-grained fusion, the present embodiment employs a conditional transform convolution layer (FiLM) as the base module for the fusion operation. Unlike the fusion method that uses semantic embedded vectors as transformation conditions for each layer, we use a multi-scale fine-grained activation map in the target picture model as transformation conditions, which enables us to make good use of fine-grained information in the middle layer high-resolution activation map. Specifically, we affine transform the residual network block (ResBlock) of the time observation model, where the firstAffine transformation factor in block-> and />Activating the map in the shape of the corresponding block from the target encoder>Is a condition. This process can be expressed as:
wherein Indicate->Transition activation map in block, +.>Represents the->Features in each dimension. Function-> and />Learning how to map the condition variables to affine factors. In practice we realize this as a learnable +.>The convolution layer enables the same resolution to be maintained between the input and target activation graphs, thereby maintaining fine-grained information in the activation graphs. This multi-scale fine-grained feature fusion can ultimately be expressed as:
representing a time view conditioned on the target imageAnd (5) testing the model.
S4, according to the visual state characteristicsPredicting the state of the agent at the next moment +.>So that the agent is +_ dependent on the status>The action is performed until the navigation target position is reached.
And (3) the visual state characteristics of the intelligent object pass through a cyclic neural network and are combined with the state of the historical moment to predict the state characteristics of the next moment. The state features pass through a fully connected layer, predict the distribution of executable actions, and select the optimal actions from them.
As an alternative embodiment, the action decision mode of the agent selects a state feature at each moment to predict the next moment according to the historical state feature, and uses the feature to predict the distribution of executable actions to sample the optimal actions. In particular, visual state features to be fusedInput into a recurrent neural network (GRU), combined with the hidden variable of the network +.>The modeled history state predicts the state at the next moment +.>. Subsequently, a distribution of 4 executable actions (including forward, left turn, right turn, and stop) is predicted by a full link layer, and an optimal action is sampled according to the distribution. Thereafter, the agent performs the action, eventually reaching the location indicated by the target picture.
In summary, the method of the embodiment effectively utilizes the object fine-granularity feature contained in the hidden layer high-resolution activation graph in the deep neural network, and uses the feature as a prompt to guide the visual observation model to pay attention to the region which has correlation with the target image in the current environment in low-level attribute and high-level language, thereby improving the capability of an intelligent agent to infer and find the target position in the exploration stage. Tables 1 and 2 show the results of comparison with the best existing method on two commonly used test data sets (imagenavsplit a and split B) on the picture navigation task, respectively. After the scheme is applied, the navigation success rate and the path efficiency can be obviously improved on the two picture navigation data sets.
TABLE 1
TABLE 2
The embodiment also provides a picture navigation system based on multi-scale fine granularity feature fusion, which comprises:
a target acquisition module for acquiring a target image of the navigation target position
An observation acquisition module for acquiring visual observation of the current moment of the intelligent agent in the environment
The feature fusion module is used for fusing the target imageAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>
Status ofA prediction module for predicting visual state characteristicsPredicting the state of the agent at the next moment +.>So that the agent is +_ dependent on the status>The action is performed until the navigation target position is reached.
The image navigation system based on the multi-scale fine-granularity feature fusion can execute the image navigation method based on the multi-scale fine-granularity feature fusion, can execute any combination implementation steps of the method embodiments, and has corresponding functions and beneficial effects.
The embodiment also provides a picture navigation device based on multi-scale fine granularity feature fusion, which comprises:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method illustrated in fig. 1.
The image navigation device based on the multi-scale fine-granularity feature fusion can execute the image navigation method based on the multi-scale fine-granularity feature fusion, can execute any combination implementation steps of the method embodiments, and has corresponding functions and beneficial effects.
Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
The embodiment also provides a storage medium which stores instructions or programs for executing the picture navigation method based on the multi-scale fine granularity feature fusion, and when the instructions or programs are run, the instructions or programs can execute the steps according to any combination of the embodiments of the method, and the method has the corresponding functions and beneficial effects.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (10)

1. A picture navigation method based on multi-scale fine granularity feature fusion is characterized by comprising the following steps:
acquiring a target image of a navigation target position
Obtaining visual observation of current moment of intelligent body in environment
Image of objectAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>
According to visual state characteristicsPredicting the state of the agent at the next moment +.>So that the agent is +_ dependent on the status>The action is performed until the navigation target position is reached.
2. A multi-scale based thin film according to claim 1The picture navigation method with granularity characteristic fusion is characterized in that the target image is obtained byAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>Comprising:
processing the image of the object using a first multi-layer convolutional neural networkObtaining an intermediate layer activation diagram of a target image output by different network convolution layers;
processing the visual observations using a second multi-layer convolutional neural networkThe method comprises the steps of carrying out a first treatment on the surface of the The first multi-layer convolutional neural network and the second multi-layer convolutional neural network have the same network layer number and network structure;
constructing a characteristic affine transformer, and mapping an intermediate layer activation graph of a target image into affine transformation coefficients pixel by pixel; obtaining the output of a corresponding network layer of a second multi-layer convolutional neural network, and carrying out affine transformation based on the affine transformation coefficients to obtain the characteristics of the fused middle layer; taking the fused middle layer characteristics as the input of the next layer of network in the second multilayer convolutional neural network;
fusing the output of the last layer of the first multi-layer convolutional neural network with the output of the last layer of the second multi-layer convolutional neural network to obtain visual state characteristics
3. The picture navigation method based on multi-scale fine granularity feature fusion according to claim 2, wherein the expression of the middle layer feature is:
wherein ,is +.>Middle layer feature of layer network output, +.>Is +.>Output of layer network, ++>、/>Is->Affine transformation coefficients of layers,/>、/>Representing the mapping of the conditional variable as a function of affine factors,/->Represents the->Features in each dimension.
4. The picture navigation method based on multi-scale fine granularity feature fusion according to claim 1, wherein the visual state featuresThe expression of (2) is:
wherein ,a time observation model conditioned on the target image is represented.
5. The picture navigation method based on multi-scale fine-grained feature fusion according to claim 1, wherein the multi-scale fine-grained feature fusion module adopts a conditional transformation convolution layer as a basic unit of fusion operation.
6. A picture navigation method based on multi-scale fine granularity feature fusion according to claim 1, wherein the visual state features are based onPredicting the state of the agent at the next moment +.>Comprising:
visual state features to be fusedInputting to a cyclic neural network, and obtaining the status characteristic of the next moment by combining the status prediction of the historical moment +.>
7. The picture navigation method based on multi-scale fine granularity feature fusion according to claim 1, wherein the intelligent agent is based on the statePerforming an action until reaching a navigation target location, comprising:
putting the state intoThrough a full connection layer, predicting the distribution of executable actions, and selecting the optimal actions from the predicted distribution; the agent performs this action and eventually reaches the location indicated by the target image.
8. A picture navigation device based on multi-scale fine granularity feature fusion, comprising:
a target acquisition module for acquiring a target image of the navigation target position
An observation acquisition module for acquiring visual observation of the current moment of the intelligent agent in the environment
The feature fusion module is used for fusing the target imageAnd visual observation +.>Inputting a multi-scale fine granularity characteristic fusion module to perform multi-scale fine granularity characteristic fusion, and outputting the fused visual state characteristic +.>
A state prediction module for predicting visual state according to the visual state characteristicsPredicting the state of the agent at the next moment +.>So that the agent is +_ dependent on the status>The action is performed until the navigation target position is reached.
9. A picture navigation device based on multi-scale fine granularity feature fusion, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-7.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-7 when being executed by a processor.
CN202310890318.0A 2023-07-20 2023-07-20 Picture navigation method, device and medium based on multi-scale fine granularity feature fusion Active CN116608866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310890318.0A CN116608866B (en) 2023-07-20 2023-07-20 Picture navigation method, device and medium based on multi-scale fine granularity feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310890318.0A CN116608866B (en) 2023-07-20 2023-07-20 Picture navigation method, device and medium based on multi-scale fine granularity feature fusion

Publications (2)

Publication Number Publication Date
CN116608866A true CN116608866A (en) 2023-08-18
CN116608866B CN116608866B (en) 2023-09-26

Family

ID=87680434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310890318.0A Active CN116608866B (en) 2023-07-20 2023-07-20 Picture navigation method, device and medium based on multi-scale fine granularity feature fusion

Country Status (1)

Country Link
CN (1) CN116608866B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120316784A1 (en) * 2011-06-09 2012-12-13 Microsoft Corporation Hybrid-approach for localizaton of an agent
US11023730B1 (en) * 2020-01-02 2021-06-01 International Business Machines Corporation Fine-grained visual recognition in mobile augmented reality
CN113393474A (en) * 2021-06-10 2021-09-14 北京邮电大学 Feature fusion based three-dimensional point cloud classification and segmentation method
CN114692750A (en) * 2022-03-29 2022-07-01 华南师范大学 Fine-grained image classification method and device, electronic equipment and storage medium
CN115830402A (en) * 2023-02-21 2023-03-21 华东交通大学 Fine-grained image recognition classification model training method, device and equipment
CN116263335A (en) * 2023-02-07 2023-06-16 浙江大学 Indoor navigation method based on vision and radar information fusion and reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120316784A1 (en) * 2011-06-09 2012-12-13 Microsoft Corporation Hybrid-approach for localizaton of an agent
US11023730B1 (en) * 2020-01-02 2021-06-01 International Business Machines Corporation Fine-grained visual recognition in mobile augmented reality
CN113393474A (en) * 2021-06-10 2021-09-14 北京邮电大学 Feature fusion based three-dimensional point cloud classification and segmentation method
CN114692750A (en) * 2022-03-29 2022-07-01 华南师范大学 Fine-grained image classification method and device, electronic equipment and storage medium
CN116263335A (en) * 2023-02-07 2023-06-16 浙江大学 Indoor navigation method based on vision and radar information fusion and reinforcement learning
CN115830402A (en) * 2023-02-21 2023-03-21 华东交通大学 Fine-grained image recognition classification model training method, device and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭明奎等: "深度对抗视觉生成综述", 《中国图象图形学报》, no. 12, pages 2751 - 2766 *

Also Published As

Publication number Publication date
CN116608866B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
Ramakrishnan et al. Occupancy anticipation for efficient exploration and navigation
Chaplot et al. Learning to explore using active neural slam
Duan et al. A survey of embodied ai: From simulators to research tasks
Ravichandran et al. Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks
Liang et al. Sscnav: Confidence-aware semantic scene completion for visual semantic navigation
CN105144196B (en) Method and apparatus for calculating camera or object gesture
Zheng et al. Active scene understanding via online semantic reconstruction
Georgakis et al. Uncertainty-driven planner for exploration and navigation
CN114460943B (en) Self-adaptive target navigation method and system for service robot
Nilsson et al. Embodied visual active learning for semantic segmentation
CN116343012B (en) Panoramic image glance path prediction method based on depth Markov model
US20230330846A1 (en) Cross-domain imitation learning using goal conditioned policies
Ye et al. From seeing to moving: A survey on learning for visual indoor navigation (vin)
EP4172861A1 (en) Semi-supervised keypoint based models
Wu et al. Vision-language navigation: a survey and taxonomy
Schmid et al. Explore, approach, and terminate: Evaluating subtasks in active visual object search based on deep reinforcement learning
Zhang et al. A survey of visual navigation: From geometry to embodied AI
Jia et al. Learning to act with affordance-aware multimodal neural slam
CN116499471B (en) Visual language navigation method, device and medium based on open scene map
Niwa et al. Spatio-temporal graph localization networks for image-based navigation
Ehsani et al. Object manipulation via visual target localization
CN116608866B (en) Picture navigation method, device and medium based on multi-scale fine granularity feature fusion
CN116576861A (en) Visual language navigation method, system, device and storage medium
Jaunet et al. Sim2realviz: Visualizing the sim2real gap in robot ego-pose estimation
CN115993783A (en) Method executed by intelligent device and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant