CN116643648B

CN116643648B - Three-dimensional scene matching interaction method, device, equipment and storage medium

Info

Publication number: CN116643648B
Application number: CN202310391596.1A
Authority: CN
Inventors: 戴健; 吴锐; 刘歆浏; 祝本明; 任珍文
Original assignee: China South Industries Group Automation Research Institute
Current assignee: China South Industries Group Automation Research Institute
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-12-19
Anticipated expiration: 2043-04-13
Also published as: CN116643648A

Abstract

The invention discloses a three-dimensional scene matching interaction method, a device, equipment and a storage medium, which rely on an image acquisition technology, a three-dimensional retrieval technology and a man-machine interaction technology, and acquire a two-dimensional VR image in a user field of view in a real application scene. The scene-based model matching algorithm is provided, and the immersive human-computer interaction effect is realized through a model matching search algorithm similar to a scene object in a three-dimensional model library and a technology of actively adjusting a three-dimensional model in a virtual view of a user along with the gesture of the user.

Description

Three-dimensional scene matching interaction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of virtual reality technologies, and in particular, to a three-dimensional scene matching interaction method, apparatus, device, and storage medium.

Background

Virtual Reality (VR) technology is a computer simulation system that can create and experience a virtual world, which utilizes a computer to generate a simulated environment. Through VR technology, the user can construct a three-dimensional dynamic view and entity behavior simulation system with multi-source information fusion and human-computer seamless interaction.

The virtual-real combined three-dimensional scene matching technology is a novel display matching technology which starts to have application requirements along with the development of computer software and hardware. The virtual-real combination technology can blend the virtual environment into the real scene around the user, thereby providing visual and enhanced use experience; and the three-dimensional matching technology has higher operation freedom degree in the three-dimensional space, so that more visual and real feeling is formed.

In VR systems in the prior art, binocular stereoscopic vision plays a great role. The binocular stereoscopic vision system consists of a left binocular light waveguide display module, a right binocular light waveguide display module and a vision sensor. Acquisition of digital images is a source of stereoscopic information. The common stereoscopic vision image is generally obtained by moving or rotating two cameras at different positions to shoot the same scene. There are various ways of image acquisition, mainly determined by the specific application and purpose.

However, binocular stereoscopic systems are implemented in which the different images seen by the user's two eyes are generated separately, often requiring display on different displays. Some systems use a single display, and after a user wears special glasses, only an odd frame image can be seen by one eye, only an even frame image can be seen by the other eye, and the difference between the odd and even frames, namely parallax, generates a stereoscopic effect. As can be seen, in the prior art, when providing stereoscopic images for users, a binocular stereoscopic vision system is adopted, two displays are required to be arranged or special glasses are required, so that the manufacturing cost of the device is increased.

Meanwhile, in the prior art, when the head of the user is tracked, a simple visual mapping (SLAM) algorithm is generally adopted, however, in the rapid movement of a camera, the simple visual SLAM has the disadvantages of motion blur and small interframe overlapping area, so that the feature matching difficulty is high, the robustness is poor, and the positioning precision is low.

Disclosure of Invention

In view of the foregoing, the present invention provides a three-dimensional scene matching interaction method, apparatus, device and storage medium for overcoming or at least partially solving the foregoing problems.

The invention provides the following scheme:

a three-dimensional scene matching interaction method, comprising:

acquiring a two-dimensional image and visual angle information, focal length information and depth information in a user visual field at the current moment; the two-dimensional image includes a target object;

matching the two-dimensional image with a three-dimensional model in a three-dimensional model library by utilizing the view angle information, the focal length information and the depth information to obtain a target three-dimensional model similar to the target object;

acquiring the posture data of the head of a user at the current moment, and determining the relative motion relation between the head posture of the user and the target three-dimensional model by utilizing the posture data;

determining the inclination angle of the image selection frame and the corresponding position relation between the central point of the image selection frame and the target object according to the gesture data;

and adjusting the target three-dimensional model to a target pose and placing the target pose into the image selection frame by combining the inclination angle of the image selection frame, the corresponding position relation between the center point of the image selection frame and the target object and the relative motion relation.

Preferably: acquiring a two-dimensional image and visual angle information, focal length information and depth information in a user visual field at the current moment; comprising the following steps:

collecting pixel components through an image collecting module, wherein the image collecting module comprises a plurality of light sensing assemblies;

detecting light rays in the visual field, and forming mosaic color filter arrays on the grids of the respective light sensing assemblies through three-dimensional color filter arrays;

performing interpolation processing on the obtained color information to obtain a green component, a red component and a blue component of each pixel point;

and judging according to the human body posture and the motion of the user, and transmitting signals into a multi-view model to obtain a two-dimensional image, visual angle information, focal length information and depth information in the visual field of the user at the current moment.

Preferably: matching the two-dimensional image with a three-dimensional model in a three-dimensional model library by utilizing the view angle information, the focal length information and the depth information to obtain a target three-dimensional model similar to the target object; comprising the following steps:

acquiring a geometric distance between the user and the target object through the two-dimensional image;

selecting a plurality of alternative three-dimensional models with similar corresponding contours by combining the geometric distance, the visual angle information, the focal length information and the depth of field information with the three-dimensional model library of the three-dimensional contour matching models;

and obtaining the target three-dimensional model from the plurality of alternative three-dimensional models through a feature matching method.

Preferably: the feature matching method comprises the following steps:

acquiring image features contained in the two-dimensional image and a plurality of model features of a plurality of alternative three-dimensional models;

mapping the image features and a plurality of model features in a unified feature space by using a feature mapping model, and obtaining a plurality of similarity values according to the distances between the image features and the model features;

and taking the candidate three-dimensional model corresponding to the model feature with the highest similarity value as the target three-dimensional model.

Preferably: the posture data of the head of the user comprise the position of the head, the rotation angle of the head and the rotation direction of the head.

Preferably: acquiring the posture data of the head of the user at the current moment comprises the following steps:

acquiring the side-tipping angle of the current head acquired by the gyroscope; and simultaneously, tracking information of a photon sensor on the head-mounted display by combining the fixed positioner to obtain the posture data of the head of the user at the current moment.

A three-dimensional scene matching interaction device, the device comprising:

the two-dimensional image acquisition unit is used for acquiring a two-dimensional image, visual angle information, focal length information and depth of field information in the visual field of the user at the current moment; the two-dimensional image includes a target object;

the target three-dimensional model acquisition unit is used for matching the two-dimensional image with a three-dimensional model in a three-dimensional model library by utilizing the visual angle information, the focal length information and the depth information to obtain a target three-dimensional model similar to the target object;

the relative motion relation determining unit is used for acquiring the posture data of the head of the user at the current moment and determining the relative motion relation between the head pose of the user and the target three-dimensional model by utilizing the posture data;

the corresponding position relation determining unit is used for determining the inclination angle of the image selection frame and the corresponding position relation between the central point of the image selection frame and the target object according to the gesture data;

and the target three-dimensional model adjusting unit is used for adjusting the target three-dimensional model to a target pose and placing the target three-dimensional model into the image selection frame by combining the inclination angle of the image selection frame, the corresponding position relation between the center point of the image selection frame and the target object and the relative motion relation.

A three-dimensional scene matching interaction device, the device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is used for executing the three-dimensional scene matching interaction method according to the instructions in the program codes.

A computer readable storage medium for storing program code for performing the three-dimensional scene matching interaction method described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the three-dimensional scene matching interaction method, device, equipment and storage medium, the two-dimensional VR images in the user field of view are collected in a real application scene by means of an image collection technology, a three-dimensional retrieval technology and a man-machine interaction technology. The scene-based model matching algorithm is provided, and the immersive human-computer interaction effect is realized through a model matching search algorithm similar to a scene object in a three-dimensional model library and a technology of actively adjusting a three-dimensional model in a virtual view of a user along with the gesture of the user.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings by those of ordinary skill in the art without inventive effort.

FIG. 1 is a flow chart of a three-dimensional scene matching interaction method provided by an embodiment of the invention;

FIG. 2 is a block diagram of scene-based model matching provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a man-machine interaction scheme provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a two-dimensional image-based three-dimensional model retrieval provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a three-dimensional scene matching interaction device according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a three-dimensional scene matching interaction device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.

Referring to fig. 1, a three-dimensional scene matching interaction method provided by an embodiment of the present invention, as shown in fig. 1, may include:

s101: acquiring a two-dimensional image and visual angle information, focal length information and depth information in a user visual field at the current moment; the two-dimensional image includes a target object; specifically, the pixel components are collected through an image collection module, wherein the image collection module comprises a plurality of light sensing assemblies;

S102: matching the two-dimensional image with a three-dimensional model in a three-dimensional model library by utilizing the view angle information, the focal length information and the depth information to obtain a target three-dimensional model similar to the target object; specifically, the geometric distance between the user and the target object is obtained through the two-dimensional image;

S103: acquiring the posture data of the head of a user at the current moment, and determining the relative motion relation between the head posture of the user and the target three-dimensional model by utilizing the posture data; specifically, the posture data of the head of the user comprises the position of the head, the rotation angle of the head and the rotation direction of the head. Acquiring the side-tipping angle of the current head acquired by the gyroscope; and simultaneously, tracking information of a photon sensor on the head-mounted display by combining the fixed positioner to obtain the posture data of the head of the user at the current moment.

S104: determining the inclination angle of the image selection frame and the corresponding position relation between the central point of the image selection frame and the target object according to the gesture data;

s105: and adjusting the target three-dimensional model to a target pose and placing the target pose into the image selection frame by combining the inclination angle of the image selection frame, the corresponding position relation between the center point of the image selection frame and the target object and the relative motion relation.

According to the three-dimensional scene matching interaction method provided by the embodiment of the application, the two-dimensional VR images in the user field of view are acquired in a real application scene by means of an image acquisition technology, a three-dimensional retrieval technology and a man-machine interaction technology. The scene-based model matching algorithm is provided, and the immersive human-computer interaction effect is realized through a model matching search algorithm similar to a scene object in a three-dimensional model library and a technology of actively adjusting a three-dimensional model in a virtual view of a user along with the gesture of the user.

The following describes the three-dimensional scene matching interaction method provided by the embodiment of the application in detail.

Virtual Reality (VR) technology is a computer simulation system that can create and experience a virtual world, which utilizes a computer to generate a simulated environment. Through VR technology, the user can construct a three-dimensional dynamic view and entity behavior simulation system with multi-source information fusion and human-computer seamless interaction. In view of this, the embodiments of the present application rely on image acquisition techniques, three-dimensional retrieval techniques, and man-machine interaction techniques, and acquire two-dimensional VR images in a user field of view in a real application scenario. The model matching algorithm based on the scene mainly comprises a model matching search algorithm similar to a scene object in a three-dimensional model library and a technology for actively adjusting a three-dimensional model in a virtual view of a user along with the gesture of the user, so that the effect of immersive human-computer interaction is realized.

As shown in fig. 2 and fig. 3, the method provided in the embodiment of the present application introduces a contour matching model and a feature matching model, retrieves a three-dimensional model most similar to an object in a scene from a three-dimensional model library, and then uses pose information of VR glasses to align the retrieved model with the corresponding object in the input scene. And the three-dimensional model matching of the two-dimensional images in the real application scene is realized, and meanwhile, the interactive experience between the user and the virtual model is improved.

The method can be divided into three technical stages, including three parts of sensing, searching and interaction, and the specific implementation flow is as follows:

(1) Sensing: and taking the multiple sensors as a support to acquire a two-dimensional multi-view image of the object.

VR image acquisition relies on a multi-sensor technology to acquire images in a user's field of view, namely, an image acquisition module acquires pixel components, specifically, light rays in the field of view, such as natural light and artificial light, are detected through wide angles, standards and long-focus lenses; forming mosaic color filter arrays on respective light sensing component grids through three-way color filter arrays, and finally obtaining green components, red components and blue components of each pixel point through interpolation processing on the obtained color information; meanwhile, the two-dimensional image in the user field of view at the current moment and the information such as the visual angle, the focal length, the depth of field and the like of the two-dimensional image are obtained by judging according to the human body gesture and the action and transmitting signals into a designed multi-view model.

(2) And (5) searching: and taking the two-dimensional image as input, and searching the corresponding three-dimensional model.

The method and the device start from establishing the relevance between the image and the three-dimensional model, and the three-dimensional model similar to the image is searched by utilizing the image acquired by the single VR helmet, so that three-dimensional search based on the two-dimensional image is realized.

Macroscopically, analyzing information such as geometric distance, visual angle, focal length, depth of field and the like of the two-dimensional image, and selecting a corresponding model from a three-dimensional model library by means of designed two-dimensional to three-dimensional contour matching models.

In addition, microscopically, the feature matching relation between the image features and the three-dimensional model is analyzed, the designed feature mapping model is used for mapping the input images and the models in the three-dimensional model library in a unified feature space, and the similarity degree of the images and the three-dimensional model is judged according to the distance between the features, so that the correlation between the images and the three-dimensional model is fully established, and finally, a target three-dimensional model similar to a target object is output, and the implementation process is shown in fig. 4.

(3) Interaction: based on the human body posture and the motion, the position of the three-dimensional model is adjusted.

In order to ensure that when the pose of the head changes, corresponding geometric transformation of a three-dimensional model is realized, and tight interaction between a user and the model is improved, the embodiment of the application obtains the side-tipping angle of the current head through a gyroscope; meanwhile, the information of the photon sensor on the head-mounted display is tracked by combining the fixed positioner, so that the head posture data at the current moment is recorded.

When the head posture data is acquired and positioned, the head posture data can also be acquired in a complementary mode of binocular SLAM and I MU. The mixed reality augmented display glasses are generally used in environments with complex conditions, and it is important to realize autonomous spatial perception of the glasses, and a synchronous positioning and mapping (SLAM) algorithm is a core technology of a spatial perception positioning algorithm. In the rapid movement of a camera, the simple visual SLAM has the advantages of motion blur and small interframe overlapping area, so that the feature matching difficulty is high, the robustness is poor, and the positioning precision is low. The I MU provides better estimation for rapid motion, camera rotation gesture and the like in a short time, and meanwhile, the camera information can effectively solve the problem of static drift of the I MU. Monocular visual SLAM has the defects of uncertain initialized scale, tracking scale drift and the like, and binocular SLAM is selected to be complementary with I MU, so that the camera, namely the head, can be tracked and positioned rapidly.

The coupling method between vision and I MU inertial measurement includes loose coupling and tight coupling. The two are mutually independent in loose coupling, calculation is carried out respectively, automatic aluminum foil and a corresponding method are adopted to estimate system attitude data, and the position, the direction and the speed of the visual image feature and the I MU integral are fused in tight coupling, so that the optimized position, direction and speed are output. The tight coupling requires more computation than the loose coupling, requires more real-time performance of the system, and cannot be computed when determining a portion of the information. In order to improve the real-time performance of calculation and improve the frame frequency and robustness of calculation, a loose coupling mode is selected for calculation.

The visual data and the inertial data information are fused in an optimization and improvement mode, a visual inertial odometer is established, and optimization solution is carried out by coordinating inertial measurement and vision, so that the aims of improvement, optimization and fusion are achieved. After the head of the user is tracked and positioned, the gesture interaction space of the user can be positioned according to parameters such as internal parameters, external parameters and the like of the camera for collecting the hand information.

By integrating the data of the gyroscope and the fixed positioner and combining the designed head pose model, the obtained data such as the position, the rotation angle, the rotation direction and the like of the head are combined, so that the relative motion relation between the head pose and the three-dimensional model is approximately determined, the inclination angle of the image selection frame and the corresponding position relation between the center point of the image selection frame and the object are determined, the three-dimensional object is accurately placed into the image selection frame when the head rotates, and the visual angle of the target three-dimensional model presented to a user correspondingly rotates along with the rotation of the head. The inclination angle of the image selection frame can be correspondingly adjusted by the three-dimensional object according to the difference between the preset angle of the image selection frame and the rotation angle of the head, and the three-dimensional matching of the motion and the model is realized when the pose of the head changes.

Referring to fig. 5, the embodiment of the present application may further provide a three-dimensional scene matching interaction device, as shown in fig. 5, where the device may include:

a two-dimensional image acquiring unit 501, configured to acquire a two-dimensional image, view angle information, focal length information, and depth of field information in a user's view at a current moment; the two-dimensional image includes a target object;

a target three-dimensional model obtaining unit 502, configured to match the two-dimensional image with a three-dimensional model in a three-dimensional model library by using the view angle information, the focal length information, and the depth information, to obtain a target three-dimensional model similar to the target object;

a relative motion relationship determining unit 503, configured to obtain pose data of a head of a user at a current moment, and determine a relative motion relationship between a pose of the head of the user and the target three-dimensional model using the pose data;

a corresponding position relation determining unit 504, configured to determine, according to the gesture data, a tilt angle of the image selection frame, and a corresponding position relation between a center point of the image selection frame and the target object;

the target three-dimensional model adjusting unit 505 is configured to adjust the target three-dimensional model to a target pose and place the target three-dimensional model into the image selection frame in combination with an inclination angle of the image selection frame, a corresponding positional relationship between a center point of the image selection frame and the target object, and the relative motion relationship.

The embodiment of the application can also provide three-dimensional scene matching interaction equipment, which comprises a processor and a memory:

the processor is used for executing the steps of the three-dimensional scene matching interaction method according to the instructions in the program codes.

As shown in fig. 6, a three-dimensional scene matching interaction device provided in an embodiment of the present application may include: a processor 10, a memory 11, a communication interface 12 and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all complete communication with each other through a communication bus 13.

In the present embodiment, the processor 10 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, a field programmable gate array, or other programmable logic device, etc.

The processor 10 may call a program stored in the memory 11, and in particular, the processor 10 may perform operations in an embodiment of the three-dimensional scene matching interaction method.

The memory 11 is used for storing one or more programs, and the programs may include program codes, where the program codes include computer operation instructions, and in this embodiment, at least the programs for implementing the following functions are stored in the memory 11:

In one possible implementation, the memory 11 may include a storage program area and a storage data area, where the storage program area may store an operating system, and application programs required for at least one function (such as a file creation function, a data read-write function), and the like; the store data area may store data created during use, such as initialization data, etc.

In addition, the memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.

The communication interface 12 may be an interface of a communication module for interfacing with other devices or systems.

Of course, it should be noted that the structure shown in fig. 6 does not limit the three-dimensional scene matching interaction device in the embodiment of the present application, and in practical application, the three-dimensional scene matching interaction device may include more or fewer components than those shown in fig. 6, or some components may be combined.

Embodiments of the present application may also provide a computer readable storage medium for storing program code for performing the steps of the three-dimensional scene matching interaction method described above.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the description of the embodiments above, it will be apparent to those skilled in the art that the present application may be implemented in software plus the necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A three-dimensional scene matching interaction method, comprising:

acquiring the posture data of the head of the user at the current moment, and determining the relative motion relation between the head posture of the user and the target three-dimensional model by utilizing the posture data;

2. The three-dimensional scene matching interaction method according to claim 1, wherein two-dimensional images and view angle information, focal length information and depth information in the user's view at the current moment are acquired; comprising the following steps:

and judging according to the human body posture and the action of the user, and transmitting signals into the multi-view model to obtain a two-dimensional image and visual angle information, focal length information and depth information in the visual field of the user at the current moment.

3. The three-dimensional scene matching interaction method according to claim 1, wherein the two-dimensional image is matched with a three-dimensional model in a three-dimensional model library by utilizing the view angle information, the focal length information and the depth information to obtain a target three-dimensional model similar to the target object; comprising the following steps:

selecting a plurality of alternative three-dimensional models with similar corresponding contours in the three-dimensional model library by utilizing the geometric distance, the visual angle information, the focal length information and the depth of field information and combining the three-dimensional contour matching models;

4. A three-dimensional scene matching interaction method according to claim 3, characterized in that the feature matching method comprises:

5. The three-dimensional scene matching interaction method according to claim 1, wherein the posture data of the user's head includes a position where the head is located, a head rotation angle, and a head rotation direction.

6. The three-dimensional scene matching interaction method according to claim 5, wherein acquiring the pose data of the user's head at the current time comprises:

7. A three-dimensional scene matching interactive apparatus, the apparatus comprising:

8. A three-dimensional scene matching interactive device, the device comprising a processor and a memory:

the processor is configured to perform the three-dimensional scene matching interaction method of any of claims 1-6 according to instructions in the program code.

9. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a program code for performing the three-dimensional scene matching interaction method of any of claims 1-6.