CN111768729A

CN111768729A - VR scene automatic explanation method, system and storage medium

Info

Publication number: CN111768729A
Application number: CN201910263029.1A
Authority: CN
Inventors: 邓涛; 周鹏; 冀德
Original assignee: Beijing Chuansong Technology Co ltd
Current assignee: Beijing Chuansong Technology Co ltd
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2020-10-13

Abstract

The invention relates to a VR scene automatic explanation method, a system and a storage medium, comprising: acquiring three-dimensional data of a target object in a VR scene to be explained and a category corresponding to the three-dimensional data, and training according to target characteristics of the three-dimensional data to obtain a scene content identification module; and acquiring a VR picture watched by the user in real time, inputting the VR picture into the scene content identification module, judging whether the target object exists in the VR picture, playing corresponding explanation content to the user according to the category of the target object if the target object exists in the VR picture, and continuously acquiring the VR picture watched by the user in real time if the target object exists in the VR picture, inputting the VR picture into the scene content identification module, and judging whether the target object exists in the VR picture. The method has no requirement on the manufacture of the original VR scene, and the VR picture is extracted on the premise of not changing the original VR application. The invention realizes scene classification by means of feature recognition or machine learning and the like.

Description

VR scene automatic explanation method, system and storage medium

Technical Field

The invention relates to the field of Virtual Reality (VR) and artificial intelligence, and particularly relates to a method, a system and a storage medium for automatically explaining VR scenes.

Background

Self-help commentary systems are already on the market, and are commonly used in large scenic spots or exhibition halls, such systems usually acquire the coordinate position of a user through a GPS positioning module or a radio frequency identification technology, and then play the pre-recorded voice content, for example, patent 01274768.8 discloses an electronic tour guide device, which realizes the positioning of the position through radio or infrared coding. The 200310110653.7 patent utilizes GPS location technology to locate the guest and then plays the corresponding audio material. Most of the self-help commentary systems are based on the real natural world, and the technologies are limited to the real world and are ineffective for VR scenes no matter radio frequency positioning or GPS positioning.

In recent years more and more VR applications are coming into view of people, such as VR games, VR exhibition halls/VR education/scenic spots or VR experience halls in home scenes, etc. In such VR applications, text prompts are often provided in a VR screen to help users experience, and the prompt information is solidified into a program during program design, so that the experience of all people is uniform and personalized is difficult to embody. Still other VR experiences may involve a worker sideways to the user some of the content in the current VR frame during the user experience. This type of experience greatly reduces the immersion of the VR.

For example, for an existing home decoration field sample house VR scene, a traditional solution is that when a client experiences the VR scene, a worker makes a targeted explanation for the client according to screen mapping picture information, if the client waits for 100 people a day, the worker needs to explain 100 times, which greatly increases communication cost, and another alternative is that the worker selects to record and play the explanation content in advance, and when the user experiences the VR, the corresponding recording is controlled and played, considering that the client may not experience according to a set sequence when experiencing the VR, so it is difficult to control a voice playing sequence well.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide an external self-help comment system applied to a VR scene, which can automatically identify a target according to a VR real-time picture and play corresponding voice content on the basis of not interfering the operation of the original VR scene, and different workers can record their own comment schemes for different styles or different articles in the scene.

Specifically, the invention discloses an automatic explanation method for VR scenes, which comprises the following steps:

step 1, acquiring three-dimensional data and corresponding categories of target objects in a VR scene to be explained, and training according to target characteristics of the three-dimensional data to obtain a scene content identification module;

step 2, obtaining a VR picture watched by a user in real time, inputting the VR picture to the scene content identification module, judging whether the target object exists in the VR picture, if so, executing step 3, otherwise, continuing to execute step 2;

and 3, playing the corresponding comment content to the user according to the type of the target object, and executing the step 2 after the comment content is played.

In the VR scene automatic explanation method, the training process of the scene content identification module in step 1 specifically includes:

and for the three-dimensional data of each target object, respectively capturing the target object pictures from multiple angles, capturing a target image as a template of the target object by using the maximum external frame of the target object in each picture, respectively obtaining SURF characteristic points and characteristic description vectors of the SURF characteristic points for each template, then storing the SURF characteristic points and the characteristic description vectors of the SURF characteristic points into a dictionary, recording all the target objects and the corresponding categories of each target object in the dictionary, and storing the dictionary as a judgment basis of the scene content identification module.

In the VR scene automatic explanation method, the process of the scene content identification module determining whether the target object exists in the VR picture in step 2 specifically includes:

and capturing an image in a visual central area of the VR picture as a target image, inputting all SURF feature points and feature description vectors of the target image into the scene content recognition module, calculating the number of feature matching between the feature description vectors of the target image and features of each category in the dictionary, and judging whether the target object exists in the VR picture according to the comparison between the number and a preset value.

for each target object, intercepting multiple images of the target object from multiple angles and distances to form a sample set, expanding the number of the sample set by rotating and/or randomly changing the size and/or dithering the color and/or changing the brightness contrast of the sample image, dividing the sample set into a training sample set and a testing sample set according to a preset proportion, and generating a labeling data text;

training the training sample set by using a convolutional neural network model, changing the number of nodes of a full connection layer at the last layer of the convolutional neural network model into the number of target categories, verifying the training model by using a test sample set in the training process, and storing the convolutional neural network model as the scene content identification module after the verification is qualified;

the process of determining whether the target object exists in the VR picture by the scene content identification module in step 2 specifically includes:

and capturing an image in the visual center area of the VR picture as a target image, and sending the target image into the scene content identification module for identification and judgment to judge whether the target object exists in the VR picture.

The invention also discloses an automatic explanation system for the VR scene, which comprises:

the method comprises the following steps that a module 1 acquires three-dimensional data of a target object in a VR scene to be explained and a corresponding category of the target object, and a scene content recognition module is obtained according to target characteristics of the three-dimensional data;

the module 2 is used for acquiring a VR picture watched by a user in real time, inputting the VR picture to the scene content identification module, judging whether the target object exists in the VR picture, if so, executing the module 3, otherwise, continuing to execute the module 2;

and the module 3 plays the corresponding explanation content to the user according to the category of the target object, and executes the module 2 after the explanation content is played.

The VR scene automatic commentary system, wherein the training process of the scene content recognition module in the module 1 specifically includes:

In the VR scene automatic explanation system, the process of determining whether the target object exists in the VR picture by the scene content identification module in the module 2 specifically includes:

the process of determining whether the target object exists in the VR picture by the scene content identification module in the module 2 specifically includes:

The invention also discloses an implementation method for the VR scene automatic commentary system.

The invention also discloses a storage medium for storing a program for executing the VR scene automatic comment method.

The invention can be applied to all the existing VR scenes, and has the technical advantages that:

(1) and extracting external pictures. The application of the method has no requirement on the manufacture of the original VR scene, and the method carries out VR picture extraction on the premise of not changing the original VR application.

(2) And intelligently identifying VR scenes. The invention realizes scene classification by means of feature recognition or machine learning and the like.

Drawings

Fig. 1 is an external VR self-help explanation system framework diagram.

Detailed Description

The invention realizes a real-time VR intelligent commentary system, which uses a computer vision technology to classify and identify objects in a real-time VR scene picture and then plays a corresponding commentary recording according to an identification result. The system comprises three modules, and the general working flow is as follows: (1) VR picture draws module. The VR picture is captured by a proper method, for example, the rendered VR picture can be captured by means of openVR SDK or openGL, a DirectX rendering engine and the like, and the picture displayed on a screen or glasses by a VR program can also be collected by a video capture card. (2) VR scene identification module. And (4) setting the number of scenes needing automatic explanation in the VR scene as N, and training a scene content identification module by extracting and integrating each target feature. And in the running process of the system, inputting the real-time intercepted VR picture into a scene classification model for scene recognition and classification. (3) And a voice recognition and playing module. And making a corresponding recording for each scene in advance, and calling and playing the corresponding recording according to the scene classification in the system operation process.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The external VR scene self-help explanation system mainly comprises three modules, namely a VR picture extraction module, a VR scene content recognition module and a voice playing module (see attached figure 1), and the detailed working flow of the system is as follows:

1. VR picture draws module. The existing VR applications are mostly operated at a computer end, and after the VR scene contents are rendered at the computer end, the frames are transmitted to a VR helmet for display in an HDMI or wireless mode. On the premise of not interfering the original VR application, there are various means for extracting the VR picture. The common means are as follows: a. and acquiring the picture at the current moment by using an HDMI deconcentrator, and transmitting the picture back to the computer by using a video acquisition card. b. Programming HDMI drive, and directly capturing VR picture from computer HDMI signal. c. For some VR applications with desktop mapping functions, a current screen can be directly subjected to screenshot to obtain a VR image. D. For VR applications using some common framework designs (such as openVR), VR pictures can be obtained through SDK through these. The acquired VR picture is transmitted to the scene content identification module, and in consideration of the fact that in practical application, a user cannot frequently switch pictures at a high speed, the VR picture is captured at a time interval of 1-3 seconds.

2. VR scene content identification module. The module is used for analyzing scene content, and identifying and classifying objects in the picture, such as objects in the visual center range. Object classification recognition through image content is based on machine vision methods in many cases. The invention provides two solutions, and supposes that a certain VR scene has M target objects needing to be identified and explained.

And 2.1, an object identification scheme based on image features. The specific implementation process is as follows:

and respectively intercepting a picture from 8 directions for each target object, and intercepting a target image as a template of the target object by using the maximum external frame of the target object in each picture. The SURF (speedUp RobustFeatures) feature points and the feature description vectors thereof are respectively obtained for each target template, the robust feature SURF is accelerated, and the method is a robust local feature point detection and description algorithm. The method is an improvement on a Scale-invariant feature transform (SIFT) algorithm, and the improved method has the main advantages of higher speed and more suitability for real-time feature inspection. The feature points are points of interest in the image, and the feature description vectors are feature descriptions of the neighbors of the feature points. Assuming that there are two feature vectors V1 and V2 from the two images I2 and I2, respectively, if the vectors are close in distance under the metric function, then their corresponding feature points are considered as matching success.

Thus each object class corresponds to 8 feature description vectors f_i1,f_i2,...,f_i8Then storing the information into a dictionary D, wherein the dictionary D has M elements, and the key of each element corresponds to the classification C of one object_iThe values correspond to an array consisting of feature description vectors of all feature points in the 8 template images under the classification:

D＝{C_i:F_i}

F_i＝[f_i1,f_i2,...,f_i8]

wherein the dictionary is one kindThe key-value pairs are the data structures of the basic elements used for retrieval. The key is an index and the value is data. D is a dictionary whose key-value pairs are { C_i：F_iIn which C is_iBelongs to the bond, F_iBelonging to the value i ∈ [0, M), the letter i representing the ith object, C_iIs the category of the ith object, F_iIs the feature vector array of the ith object, f_i1To f_i8Are 8 feature vector values. If there are M total target objects, D has M key-value pairs, each object corresponding to a D key-value pair { C_i：F_i}. Duplicate classes may occur in the M objects, but are described here using set symbols, with emphasis on indicating elements belonging to a set, without regard to the data structure hierarchy.

For the VR picture extracted by the VR picture extraction module, firstly, an image in a visual central area is captured as a target image, and then all SURF feature points and feature description vectors F are obtained_curr。

Using a brute force matching algorithm, calculate F_currAnd (4) sorting the number of the feature points matched with the features in the template images of all the classifications in the D set according to the descending order of the number, and taking the classification corresponding to the image as a target identification classification result if the number of the most matched points exceeds N (N is generally 10-30). Otherwise, the current scene is considered to have no target object. The time complexity of the scheme is in a positive linear relation with the number of target classes in the scene, so the scheme is more suitable for VR scenes with less target classes.

2.2, a target identification scheme based on machine learning.

(1) And making training sample data. For each target object, not less than L (L is generally more than 10000) sub-images need to be cut from different angles and different distances to serve as sample images. The sample set number is then extended to 2L by rotation/random size transformation/color dithering/luminance contrast transformation. And (4) pressing a sample set as 5: the proportion of 1 is divided into a training sample set and a testing sample set, and a labeling data text is generated.

(2) And training a target classification model. Training a training sample set by using a classic convolutional neural network model resNet-CNN, changing the number of nodes of a full connection layer of the last layer into M (consistent with the number of target classes), and in the training process, adjusting network parameters to achieve the optimal classification accuracy. And (3) verifying the training model through the test sample set, and when the target classification accuracy reaches more than 90%, considering that the trained model is available, otherwise, continuously adjusting and optimizing the network structure.

(3) And (4) classifying and identifying the target. And sending each intercepted VR frame into a trained target classification model for recognition and outputting the classification.

Compared with an object recognition scheme based on image features, the scheme can support scenes with more classifications, and the computational complexity does not increase with the increase of the number of classifications.

3. And a voice playing module.

For each type of object, the user may record a piece of speech material as the narration speech. Storing all classified voices in a specified directory, and triggering the voice module to play when a target object in a VR scene appears in a visual center of a user for more than 2 seconds.

For the existing VR application, under the condition that the operation of the original application is not interfered, a VR picture can be conveniently extracted, content recognition is carried out according to picture content, and then voice content of a personal phone is played according to a recognition result. The scheme can be applied to virtual home decoration or a virtual exhibition hall, workers can record different voice contents for different user groups, and when the users experience VR, the system plays personalized explanation contents according to information such as age, sex and identity of the users, and personalized marketing effect can be achieved.

The following are method examples corresponding to the above system examples, and this embodiment mode can be implemented in cooperation with the above embodiment modes. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. An automatic explanation method for VR scenes is characterized by comprising the following steps:

2. The VR scene automatic interpretation method of claim 1, wherein the training process of the scene content recognition module in step 1 specifically includes:

3. The VR scene automatic interpretation method of claim 2, wherein the process of the scene content recognition module determining whether the target object exists in the VR frame in step 2 specifically includes:

4. The VR scene automatic interpretation method of claim 1, wherein the training process of the scene content recognition module in step 1 specifically includes:

5. An automatic explanation system for a VR scene, comprising:

6. The VR scene automatic narration system of claim 5, wherein the training process of the scene content recognition module in module 1 specifically comprises:

7. The VR scene automatic narration system of claim 6, wherein the process of the scene content recognition module in module 2 determining whether the target object exists in the VR frame specifically includes:

8. The VR scene automatic narration system of claim 5, wherein the training process of the scene content recognition module in module 1 specifically comprises:

9. An implementation method for the VR scene automatic narration system of any one of claims 5 to 8.

10. A storage medium storing a program for executing the VR scene automatic interpretation method of any one of claims 1 to 4.