CN113409470A

CN113409470A - Scene navigation method based on AR, AR glasses, electronic device and storage medium

Info

Publication number: CN113409470A
Application number: CN202110678511.9A
Authority: CN
Inventors: 施展; 王森
Original assignee: Hangzhou Companion Technology Co ltd
Current assignee: Hangzhou Companion Technology Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-17

Abstract

The application relates to an AR-based scene navigation method, AR glasses, an electronic device and a storage medium, wherein a real-time image of a space region where a user view is located is obtained; extracting an exhibit image in the real-time image; extracting global feature descriptors and feature point descriptors in the exhibit image; obtaining at least one similar image matched with the exhibit image from the images to be retrieved through preliminary matching according to the global feature descriptors; obtaining a first similar image most similar to the image of the exhibit through accurate matching according to the feature point descriptors; determining the spatial pose of the user relative to the exhibit image according to the position information of the feature points in the matched feature point pair on the exhibit image; the virtual information corresponding to the first similar image is obtained according to the space pose, and the virtual information is displayed in a space area where the visual field of the user reaches in an overlapped mode, so that the problems that the AR-based scene navigation effect is poor and the calculation resource cost is high are solved, the AR-based scene navigation effect is improved, and the calculation resource cost is reduced.

Description

Scene navigation method based on AR, AR glasses, electronic device and storage medium

Technical Field

The present application relates to the field of AR technologies, and in particular, to an AR-based scene navigation method, AR glasses, an electronic device, and a storage medium.

Background

The related art scene navigation method based on AR (Augmented Reality) includes the following key steps:

step 1, shooting a scene image by using AR glasses, and extracting feature points of the image. Manually designed features are used to extract feature points.

And 2, establishing an off-line bag-of-words model. The bag-of-words model is obtained by a feature coding algorithm based on feature descriptors, such as BOW, VLAD, using the feature points extracted in step 1 on a given set of image datasets.

And step 3, matching the characteristic points. And obtaining the optimized matching point pair by adopting a random sampling consistency algorithm.

And 4, calculating the pose. And (4) calculating a corresponding transformation matrix through geometric constraint according to the matching point pairs obtained in the step (3) to obtain the position of the scene image.

And 5, displaying the position of the scene image for navigation.

The above scheme has the following defects:

(1) even images shot in the same scene are different due to the fact that ambient illumination is complex and the angle of wearing AR glasses by a user is different, and positioning the position of the scene directly through matching of feature points of the scene images causes inaccurate positioning and long time consumption.

(2) The manually designed characteristic points are obtained by modeling some local special areas in the image by means of some recognitions on geometry and mathematics, and have the defects that only local characteristics of the image are concerned, the characteristic points are sensitive to illumination and visual angle change, the generalization capability is weak, and the speed is unstable; the process of establishing the offline bag-of-words model is a codebook obtained by clustering according to manually designed local feature points, so that the global features of the image cannot be described, and the feature expression capability is greatly limited.

(3) The AR glasses are portable devices, which have low computing power and limited battery capacity, so that the endurance time is short, and the related art AR-based scene navigation method cannot take account of both the computing power and the navigation effect.

Aiming at the problems of poor scene navigation effect and high computing resource overhead based on AR in the related art, no effective solution is provided at present.

Disclosure of Invention

The embodiment provides an AR-based scene navigation method, AR glasses, an electronic device and a storage medium, so as to solve the problems that the AR-based scene navigation effect is poor and the computing resource overhead is large in the related art.

In a first aspect, in this embodiment, an AR-based scene navigation method is provided, including:

acquiring a real-time image of a space region where a user visual field reaches, and performing target recognition processing on the real-time image to obtain an exhibit image in the real-time image;

carrying out feature extraction processing on the exhibit image to obtain a global feature descriptor and a feature point descriptor of the exhibit image;

acquiring a global feature descriptor of an image to be retrieved, and determining at least one similar image matched with the image of the exhibit from the image to be retrieved according to the image of the exhibit and the global feature descriptor of the image to be retrieved; acquiring feature point descriptors of the similar images, acquiring matched feature point pairs according to the feature point descriptors of the exhibit images and the images to be retrieved, and determining a first similar image with the largest number of matched feature point pairs from the similar images;

acquiring position information of the feature points in the matched feature point pair on the exhibit image, and determining the spatial pose of the user relative to the exhibit image according to the position information; and acquiring virtual information corresponding to the first similar image according to the space pose, and displaying the virtual information in a spatial region where the visual field of the user reaches in an overlapping manner.

In some of these embodiments, the virtual information includes the coordinates of the current exhibit in the showroom that the user is in view of.

In some of these embodiments, the virtual information includes preset navigation information pointing from a current exhibit to a next exhibit.

In some embodiments, the performing feature extraction processing on the exhibit image to obtain a global feature descriptor and a feature point descriptor of the exhibit image includes:

performing feature extraction processing on the exhibit image by adopting a pre-trained convolutional neural network, wherein the pre-trained convolutional neural network comprises a first network and a second network, and a hidden layer of the first network is connected with an input end of the second network;

the method for extracting the features of the exhibit image by adopting the pre-trained convolutional neural network comprises the following steps:

inputting the exhibit image into the first network for feature extraction processing, and outputting shallow features and the global feature descriptor;

and inputting the shallow feature into the second network for feature extraction processing, and outputting the feature point descriptor.

In some embodiments, inputting the shallow features into the second network for feature extraction, and outputting the feature descriptor comprises:

inputting the shallow layer features into the second network for feature extraction processing to obtain a feature map output by the last convolutional layer in the second network;

carrying out normalization processing on the feature map to obtain a feature point response score map;

and determining the characteristic points which are larger than a preset threshold value in the characteristic point response score map, and outputting the characteristic point descriptors of the characteristic points.

In some of these embodiments, the shallow features include line information and/or edge information of the exhibit image; the global feature descriptor comprises the overall structure information of the exhibit image; the feature point descriptor includes local structure information of the exhibit image.

In some of these embodiments, a method of training the convolutional neural network comprises:

training the first network to obtain the network weight of the first network;

training the second network according to the network weight of the first network.

In a second aspect, there is provided in this embodiment AR glasses comprising: the system comprises a camera, a display screen and a processing unit, wherein the processing unit is connected with the camera and the display screen; wherein,

the camera is used for shooting a real-time image;

the processing unit is configured to execute the AR-based scene navigation method according to the first aspect;

the display screen is used for playing the virtual information generated by the processing unit.

In a third aspect, in this embodiment, there is provided an electronic apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the AR-based scene navigation method according to the first aspect.

In a fourth aspect, there is provided in the present embodiment a storage medium having stored thereon a computer program which, when executed by a processor, implements the AR-based scene navigation method of the first aspect described above.

Compared with the related art, the scene navigation method based on the AR, the AR glasses, the electronic device and the storage medium provided in the embodiment solve the problems that the scene navigation effect based on the AR is poor and the computing resource cost is large in the related art, improve the scene navigation effect based on the AR and reduce the computing resource cost.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a terminal of the AR-based scene navigation method according to the present embodiment;

fig. 2 is a flowchart of the AR-based scene navigation method of the present embodiment;

FIG. 3 is a schematic structural diagram of a pre-trained convolutional neural network of the present embodiment;

FIG. 4 is a schematic structural diagram of a pre-trained convolutional neural network of the present preferred embodiment;

fig. 5 is a schematic structural view of the AR eyeglasses of the present embodiment;

fig. 6 is a flowchart of the operation of the AR eyeglasses of the present embodiment.

Detailed Description

For a clearer understanding of the objects, aspects and advantages of the present application, reference is made to the following description and accompanying drawings.

Unless defined otherwise, technical or scientific terms used herein shall have the same general meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of this application do not denote a limitation of quantity, either in the singular or the plural. The terms "comprises," "comprising," "has," "having," and any variations thereof, as referred to in this application, are intended to cover non-exclusive inclusions; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or modules, but may include other steps or modules (elements) not listed or inherent to such process, method, article, or apparatus. Reference throughout this application to "connected," "coupled," and the like is not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference to "a plurality" in this application means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. In general, the character "/" indicates a relationship in which the objects associated before and after are an "or". The terms "first," "second," "third," and the like in this application are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or a similar computing device. For example, the method is executed on a terminal, and fig. 1 is a block diagram of a hardware structure of the terminal of the AR-based scene navigation method according to the embodiment. As shown in fig. 1, the terminal may include one or more processors 102 (only one shown in fig. 1) and a memory 104 for storing data, wherein the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely an illustration and is not intended to limit the structure of the terminal described above. For example, the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the AR-based scene navigation method in the present embodiment, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The network described above includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In this embodiment, an AR-based scene navigation method is provided, and fig. 2 is a flowchart of the AR-based scene navigation method of this embodiment, as shown in fig. 2, the flowchart includes the following steps:

step S201, acquiring a real-time image of a spatial region where a user view is located, and performing target recognition processing on the real-time image to obtain an exhibit image in the real-time image.

The method of the embodiment can be applied to AR glasses, after a user wears the AR glasses, the user can directly shoot the surrounding environment through the AR glasses to obtain a real-time image, the real-time image is preprocessed, some candidate areas are selected from the real-time image, feature extraction is carried out on the candidate areas, then the extracted features are classified through a trained classifier, and the target in the real-time image is determined according to the classification result.

Wherein the spatial area comprises a public place such as a museum, an exhibition hall or a library. Taking a museum as an example, when a user uses AR glasses to shoot a scene, the obtained real-time image may be mixed with any one or more of exhibits, backgrounds and visitors, and the present embodiment can distinguish the exhibits from numerous other things through this step.

Step S202, feature extraction processing is carried out on the exhibit image, and a global feature descriptor and a feature point descriptor of the exhibit image are obtained.

If the feature point matching is directly carried out by adopting a real-time image, more interference factors are introduced, so that the matching accuracy is low and the time consumption is long. Compared with other things, the features of the exhibit are relatively stable, so that the speed of extracting the features of the exhibit is relatively stable, and the exhibit image is adopted for feature point matching, so that more interference factors can be removed, and the calculation resources are saved for the subsequent feature extraction and feature matching of the exhibit image.

Step S203, acquiring a global feature descriptor of the image to be retrieved, and determining at least one similar image matched with the image of the exhibit from the image to be retrieved according to the image of the exhibit and the global feature descriptor of the image to be retrieved; and obtaining the feature point descriptors of the similar images, obtaining matched feature point pairs according to the feature point descriptors of the exhibit image and the image to be retrieved, and determining the first similar image with the largest number of matched feature point pairs from the similar images.

The image to be retrieved is a prestored exhibit image, the global feature descriptors are used for representing the overall structure of the exhibit image, and the matching time consumption between the global feature descriptors is lower than that between the feature points, so that the image to be retrieved and the exhibit image are matched through the global feature descriptors, similar images are preliminarily screened, and the effect of shortening the matching range by spending less matching time is achieved.

After the similar images are screened preliminarily, the matching range is narrowed, so that the most matched similar images can be screened accurately without consuming long matching time when the similar images are matched with the feature point descriptors of the exhibit images.

Step S204, acquiring the position information of the feature points in the matched feature point pairs on the exhibit image, and determining the spatial pose of the user relative to the exhibit image according to the position information; and acquiring virtual information corresponding to the first similar image according to the spatial pose, and displaying the virtual information in a spatial region where the visual field of the user reaches in an overlapping manner.

In the above-described feature extraction in step S202, in addition to extracting the feature descriptor, the position of each feature in the image may be extracted. In this embodiment, an Essential Matrix (Essential Matrix) corresponding to a plurality of feature points can be calculated according to an epipolar geometry theory, and the Essential Matrix is decomposed by a Singular Value Decomposition (SVD) algorithm to obtain a spatial pose relationship between a camera of the AR glasses and a target exhibit, so that the AR glasses present 3D content based on different spatial poses along with changes of observation angles. For example, when the exhibit is a pottery pot, different pottery pot textures can be observed from the left side, the right front and the right side of the pottery pot, and most feature points of the pottery pot are unchanged within a certain range of angular movement observed by a user, but the essential matrix of the feature points of the pottery pot can be changed along with the observed angle, so that the spatial pose of the user relative to the exhibit image can be calculated, and virtual information (such as the enlarged images of the pottery pot at different positions) about the pottery pot at different angles can be provided, thereby enhancing the scene guide effect and the experience of the user.

In some embodiments, virtual information refers to content that may be displayed on an electronic display screen, including but not limited to: text, picture, video, audio.

In the steps S201 to S204, in the real-time image preprocessing stage, the interference factor is removed by extracting the exhibited item image in the real-time image and performing feature matching according to the exhibited item image, so as to save calculation resources for feature extraction and feature matching of the exhibited item image in the following step. Furthermore, the global feature descriptors and the feature point descriptors are matched in sequence, so that the image characterization capability of the features is enhanced, the time complexity caused by directly matching and obtaining a unique similar image from all images to be retrieved is avoided, and the space complexity caused by loading the whole image into a memory is avoided. When the virtual information of the exhibit image is displayed, the 3D content based on different space poses is presented along with the change of the observation angle, and the scene guide effect is enhanced. Through the steps, the problems of poor scene navigation effect and high computing resource overhead based on the AR are solved, the scene navigation effect based on the AR is improved, and the computing resource overhead is reduced.

In some embodiments, the virtual information may mark coordinates of the current exhibit in the exhibition hall where the user views; preset guide information pointing to the next exhibit from the current exhibit can be marked; and the coordinates of the current exhibit in the exhibition hall, which is reached by the visual field of the user, can be marked at the same time, and the preset guide information pointing to the next exhibit from the current exhibit is also marked.

For example, the virtual information of the first similar image in a certain spatial pose carries a position indicating an exit of a certain exhibition hall, and the preset tour guide information may be from the exit position of the certain exhibition hall to the entrance position of the next exhibition hall.

In some embodiments, the AR glasses back-end maintains a table that represents virtual information for different exhibits, which may be updated as the exhibit location moves.

The feature points of the related technology are extracted by adopting the manually designed features, and are sensitive to the illumination and the visual angle change, weak in generalization capability, unstable in speed and large in limitation on the feature expression capability. In order to solve the problems, in some embodiments, a pre-trained convolutional neural network is adopted to perform feature extraction processing on an exhibit image, so that the processed features naturally have extremely strong generalization capability and feature characterization capability, the defect of feature point extraction based on manual design is overcome, and the accuracy of feature matching is improved.

In consideration of the low calculation power and the limited battery capacity of the AR glasses, and in order to take account of the calculation amount and the navigation effect, in order to solve the problem, fig. 3 shows a schematic structural diagram of the pre-trained convolutional neural network of the present embodiment, as shown in fig. 3, the pre-trained convolutional neural network includes a first network and a second network, and a hidden layer of the first network is connected to an input end of the second network.

Inputting the exhibit image to an input end of a first network, performing feature extraction processing via the first network, outputting shallow features from a hidden layer of the first network, and outputting a global feature descriptor from an output layer of the first network.

The shallow layer features are input to an input end of a second network, feature extraction processing is performed through the second network, and feature point descriptors are output from an output layer of the second network.

In the embodiment, the convolutional neural network for extracting the global feature descriptors and the convolutional neural network for extracting the feature descriptors are integrated into a network, so that the complexity of the convolutional neural network in actual deployment in the AR glasses is simplified, the calculated amount of the convolutional neural network is reduced, and the feature matching speed is obviously improved.

In this embodiment, the shallow feature includes line information and/or edge information of the exhibit image; the global feature descriptor comprises the overall structure information of the exhibit image; the feature point descriptor includes local structure information of the showpiece image.

Further, in some embodiments, the shallow feature is input to the second network for feature extraction, and the outputting the feature point descriptor is implemented by:

inputting the shallow layer features into a second network for feature extraction processing to obtain a feature map output by the last convolutional layer in the second network; carrying out normalization processing on the feature map to obtain a feature point response score map; and determining the characteristic points which are larger than a preset threshold value in the characteristic point response score map, and outputting the characteristic point descriptors of the characteristic points.

In this embodiment, the feature point response map is distributed with a plurality of candidate feature points, and the pixel value of each candidate feature point represents the probability of whether the pixel position is the true feature point, and the value range is 0 to 1. The feature point response map may be generated as follows:

when the second network is supervised and trained, a preset feature point response graph with the same resolution as that of the exhibit image is set as a label, the feature point center response value of the preset feature point response graph is 1, and the response value of a pixel point near the feature point center is determined by Gaussian distribution of a preset variance.

In this embodiment, the feature point tag generates tag information in a heatmap (heat map) manner. Compared with a general method of directly regressing the feature points, label information generated by the heatmap method can ensure that the training process is better converged and the feature point position is more stable; the method has the advantages of filtering the feature point descriptors which do not reach the preset condition and reducing the matching time of the feature point descriptors.

In some of these embodiments, a method of training a convolutional neural network comprises: training a first network to obtain the network weight of the first network; and training the second network according to the network weight of the first network.

For example, when the first network is trained, the initial learning rate is set to 0.1, the learning rate is attenuated to 0.1 times of the last time every 10 training periods, and the training is finished at 0.0001.

And after the training of the first network is finished, freezing the network weight of the first network, and starting to jointly train the second network. When the second network is trained, the initial learning rate is set to be 0.01, the learning rate is attenuated to be 0.1 times of the last time every 5 training periods, and the training is finished at 0.0001. The network weight ratio of the hidden layer of the processing feature point descriptor may be set to 0.8, and the network weight ratio of the hidden layer of the processing feature point response map may be set to 0.2.

Fig. 4 is a schematic structural diagram of the pre-trained convolutional neural network of the preferred embodiment, and as shown in fig. 4, the pre-trained convolutional neural network includes:

the device comprises a first network and a second network, wherein a hidden layer of the first network is connected with an input end of the second network.

The labels shown in the figure represent the properties of the convolution kernel, where Conv and MBConv both represent convolution kernels and Conv represents a depth separable convolution kernel; channels represents the number of Channels; KernelSize represents the convolution kernel size and Stride represents the step size of the convolution kernel. For example, a convolution kernel is "16, <3 × 3>, 1", where 16 represents the number of channels, <3 × 3> represents the convolution kernel size, and 1 represents the step size of the convolution kernel.

In the preferred embodiment, the global feature descriptor is a one-dimensional vector of length 512. The first network may be trained using a large-scale landmark Dataset, such as Google Landmarks Dataset v 2. The classification loss function L1 employed by the first network is as follows:

where N represents the number of samples in a batch (BatchSize) during the training process.

i represents the number of the current sample batch.

s represents a characteristic scale factor.

m₁Representative multiplicative angular distance factor, m₂Representative of an additive angular distance factor, m₃Representing an additive cosine distance factor.

y_iRepresenting the class to which the sample numbered i belongs.

And theta represents the included angle between the feature vector of the current sample i and the class boundary weight.

j represents the number of all training samples.

The classification loss function is subjected to feature normalization and weight normalization by m₁，m₂，m₃And the feature boundary is constrained, so that the learned global feature descriptor has stronger expression capability.

In the second network, the classification loss function L2 adopted for outputting the feature point response score map is as follows:

wherein G represents a gaussian distribution.

i represents the x-direction coordinate on the feature map.

j represents the y-direction coordinate on the feature map.

k represents a feature point number.

σ represents the variance.

y represents information of each feature point in the image.

Representing the true value of the training sample.

| represents taking the two-norm of the feature distance.

In the second network, the classification loss function L3 used for outputting feature point descriptors is as follows:

x represents the position coordinates of the sampling point of the first image, and x' represents the position coordinates of the sampling point of the second image.

i represents a sample number.

S represents a training sample pair, where S-1 represents a positive training sample pair and S-0 represents a negative training sample pair.

F represents the local feature at x.

| represents taking the two-norm of the feature distance.

In this embodiment, an AR glasses is further provided, and fig. 5 is a schematic structural diagram of the AR glasses in this embodiment, as shown in fig. 5, the AR glasses include: the system comprises a camera 51, a display screen 52 and a processing unit 53, wherein the processing unit 53 is connected with the camera 51 and the display screen 52; the camera 51 is used for shooting a real-time image; the processing unit 53 is configured to execute the AR-based scene navigation method of the above embodiment; the display screen 52 is used to play the virtual information generated by the processing unit 53.

The present embodiment is described and illustrated below by means of preferred embodiments.

For example, for an exhibit a in a museum, 10 photos of the exhibit a are stored in advance, the feature of each exhibit a photo is extracted by adopting the pre-trained convolutional neural network in the application, the global feature descriptor, the feature point descriptor and the feature point response score map of each exhibit a photo are obtained, and the global feature descriptor, the feature point descriptor and the feature point response score map are packaged and added into an image library to be retrieved.

When a user wears the AR glasses to visit exhibit a on-site, the AR glasses implement AR-based scene navigation through a process shown in fig. 6, as shown in fig. 6, the process includes the following steps:

and step S61, shooting the exhibit A to obtain a real-time image of the exhibit A.

And step S62, extracting the exhibit A in the real-time image to obtain an exhibit image.

And step S63, performing feature extraction processing on the exhibit image by adopting a pre-trained convolutional neural network to obtain a global feature descriptor, a feature point descriptor and a feature point response score map of the exhibit image.

And step S64, carrying out preliminary matching by using the global feature descriptors, and inquiring 6 similar images which are most matched with the images of the exhibits from the image library to be retrieved.

Step S65 is to set a threshold T, screen out points in the response score map of the feature points for which the response scores reach the threshold T, as feature points, and obtain feature point descriptors of these feature points.

And step S66, performing accurate matching by using the feature point descriptors of the screened feature points, and selecting the first similar image with the largest number of matched feature point pairs from the 6 similar images.

Step S67, acquiring the position information of the feature points in the matched feature point pairs on the exhibit image, and determining the spatial pose of the user relative to the exhibit image according to the position information; and acquiring virtual information corresponding to the first similar image according to the space pose, and superposing and displaying the virtual information on the AR glasses.

There is also provided in this embodiment an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

and S1, acquiring a real-time image of a space region where the visual field of the user reaches, and performing target recognition processing on the real-time image to obtain an exhibit image in the real-time image.

And S2, performing feature extraction processing on the exhibit image to obtain a global feature descriptor and a feature point descriptor of the exhibit image.

S3, acquiring the global feature descriptor of the image to be retrieved, and determining at least one similar image matched with the image of the exhibit from the image to be retrieved according to the image of the exhibit and the global feature descriptor of the image to be retrieved; and obtaining the feature point descriptors of the similar images, obtaining matched feature point pairs according to the feature point descriptors of the exhibit image and the image to be retrieved, and determining the first similar image with the largest number of matched feature point pairs from the similar images.

S4, acquiring the position information of the feature points in the matched feature point pair on the exhibit image, and determining the spatial pose of the user relative to the exhibit image according to the position information; and acquiring virtual information corresponding to the first similar image according to the spatial pose, and displaying the virtual information in a spatial region where the visual field of the user reaches in an overlapping manner.

It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementations, and details are not described again in this embodiment.

In addition, in combination with the AR-based scene navigation method provided in the above embodiment, a storage medium may also be provided to implement the method in this embodiment. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the AR-based scene navigation methods in the above embodiments.

It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be derived by a person skilled in the art from the examples provided herein without any inventive step, shall fall within the scope of protection of the present application.

It is obvious that the drawings are only examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application can be applied to other similar cases according to the drawings without creative efforts. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

The term "embodiment" is used herein to mean that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly or implicitly understood by one of ordinary skill in the art that the embodiments described in this application may be combined with other embodiments without conflict.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. An AR-based scene navigation method, comprising:

2. The AR-based scene navigation method of claim 1, wherein the virtual information includes coordinates of a current exhibit reached by a user's view in an exhibition hall.

3. The AR-based scene guide method of claim 1 or 2, wherein the virtual information comprises preset guide information directed from a current exhibit to a next exhibit.

4. The AR-based scene navigation method according to claim 1, wherein the performing the feature extraction process on the exhibit image to obtain the global feature descriptor and the feature descriptor of the exhibit image comprises:

5. The AR-based scene navigation method according to claim 4, wherein the shallow feature is input to the second network for feature extraction, and the outputting the feature point descriptor comprises:

6. The AR-based scene navigation method according to claim 4, wherein the shallow feature comprises line information and/or edge information of the showpiece image; the global feature descriptor comprises the overall structure information of the exhibit image; the feature point descriptor includes local structure information of the exhibit image.

7. The AR-based scene navigation method of claim 4, wherein the method of training the convolutional neural network comprises:

training the first network to obtain the network weight of the first network;

8. AR eyewear, comprising: the system comprises a camera, a display screen and a processing unit, wherein the processing unit is connected with the camera and the display screen; wherein,

the camera is used for shooting a real-time image;

the processing unit is configured to perform the AR based scene navigation method of any of claims 1 to 7;

9. An electronic apparatus comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the AR-based scene navigation method according to any one of claims 1 to 7.

10. A computer readable storage medium having a computer program stored thereon, wherein the computer program, when being executed by a processor, performs the steps of the AR based scene navigation method according to any one of claims 1 to 7.