CN114842384B - 6G-oriented haptic mode signal reconstruction method - Google Patents

6G-oriented haptic mode signal reconstruction method Download PDF

Info

Publication number
CN114842384B
CN114842384B CN202210476817.0A CN202210476817A CN114842384B CN 114842384 B CN114842384 B CN 114842384B CN 202210476817 A CN202210476817 A CN 202210476817A CN 114842384 B CN114842384 B CN 114842384B
Authority
CN
China
Prior art keywords
signal
haptic
video
module
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210476817.0A
Other languages
Chinese (zh)
Other versions
CN114842384A (en
Inventor
周亮
李昂
李沛林
陈顺
曹宇
楼婧蕾
倪守祥
陈亚男
陈建新
魏昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210476817.0A priority Critical patent/CN114842384B/en
Publication of CN114842384A publication Critical patent/CN114842384A/en
Application granted granted Critical
Publication of CN114842384B publication Critical patent/CN114842384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a 6G-oriented haptic mode signal reconstruction method, which comprises the steps of collecting data samples and constructing a data set containing video and haptic mode signals; by utilizing the semantic relevance between two modal signals, a cross-modal signal reconstruction model with internal semantic relevance driving is constructed based on deep learning; training a cross-modal signal reconstruction model by using a data set until the quality of the reconstructed signal meets the requirement or the deviation cannot be optimized continuously; in the invention, a modal dataset VisTouch comprising video and touch is constructed for a 6G cross-modal application scene; reconstructing the video modal signals with semantic relevance into haptic modal signals based on a deep learning technology; in order to improve the signal reconstruction quality, two types of loss functions, namely the countermeasures loss and the mean square error loss, are used as target functions, training is carried out based on VisTouch, and the accuracy of the reconstruction method is verified.

Description

6G-oriented haptic mode signal reconstruction method
Technical Field
The invention relates to the technical field of cross-mode communication, in particular to a 6G-oriented haptic mode signal reconstruction method.
Background
In the 6G era, the conventional multimedia application with audio-visual as a core has gradually failed to meet the immersive experience requirement of the user, so that new sensory interaction, such as touch, needs to be introduced into the novel multimedia application to bring the immersive experience to the user. However, the introduction of new mode signals tends to present a great challenge to the existing multimedia system, and the maximum throughput of network transmission is expected to be multiplied under the requirement of multi-dimensional sensory information cooperative transmission. Therefore, in order to achieve both user experience and communication quality, a cross-modal signal reconstruction scheme is urgently needed to reduce the amount of transmission data so as to support 6G immersive multimedia applications.
Research has shown that multimodal applications combine haptic signals with traditional audio video signals, and that users can get more immersive experience through touching or interactive behavior. Aiming at multi-mode application in the 6G age, an audio-visual touch cross-mode communication framework is provided, and three key scientific problems of efficient haptic signal coding, heterogeneous code stream transmission and mode information reconstruction are solved by fully mining the relevance between different mode signals. Meanwhile, a cross-mode communication framework under the condition of artificial intelligence is further provided, and the technical challenges in cross-mode communication are solved by utilizing the technologies of reinforcement learning, transfer learning and the like. The signal transmission and receiving process is accompanied by loss of different degrees, so that the inherent relevance among voice, video and touch signals is discovered, one mode signal is accurately and real-time reconstructed, and the method is one of the key points of 6G cross-mode communication research and is also considered as a key technology capable of greatly improving the immersive experience of users. In a potential immersive application scene of 6G (such as immersion cloud XR, holographic communication and sensory interconnection), the cross-mode reconstruction technology can recover the touch signal of the same object by using the existing video and audio signals, the newly generated touch signal can reconstruct the original audio and video signal in super resolution, so that the communication requirements of people, objects and environments are greatly met, and meanwhile, the millisecond-level delay under 6G provides a better connection experience for users.
For the deep learning model for realizing the cross-modal reconstruction, the performance quality depends on the quality and the scale of a data set, in theory, the larger the data volume is, the higher the labeling quality is, the more the deep model can approximate or even surpass the human performance, for example, models such as an image model AlexNet, VGG, resnet trained by using a large-scale ImageNet image data set are similar to the human recognition accuracy. Currently, audio-visual data sets are various, so that the existing work is mainly focused on exploring semantic relations between audio and video by using a depth model. In order to meet the 6G immersive experience requirement, a large-scale and high-quality audio-visual touch data set is urgently needed to assist deep learning to complete tasks such as cross-modal coding, transmission, signal processing and the like. In addition, a great deal of research is focused on restoration and reconstruction between audio and video, and research on reconstructing haptic signals by using audio and video is still in a starting stage. Meanwhile, how to semantically characterize the haptic signals in different forms and how to design a universal and robust cross-modal signal reconstruction frame has become a difficulty in realizing 6G cross-modal application due to different structures and contents of the haptic signals collected by different sensors.
Disclosure of Invention
This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application.
The present invention has been made in view of the above-mentioned problems with the conventional 6G-oriented haptic mode signal reconstruction method.
Therefore, the invention aims to provide a 6G-oriented haptic mode signal reconstruction method, which aims to solve the problem that the conventional medium video mode signal cannot be converted into a haptic mode signal.
In order to solve the technical problems, the invention provides the following technical scheme: A6G-oriented haptic mode signal reconstruction method comprises,
S1: collecting data samples, and constructing a data set containing video and touch mode signals;
S2: by utilizing the semantic relevance between two modal signals, a cross-modal signal reconstruction model with internal semantic relevance driving is constructed based on deep learning;
s3: training a cross-modal signal reconstruction model by using the data set until the quality of the reconstructed signal meets the requirement or the deviation cannot be optimized continuously.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: the data sample collection comprises the steps of selecting a collected sample and classifying the collected sample; selecting acquisition equipment, and synchronously setting the acquisition equipment; setting an acquisition mode, and acquiring video signals and touch signals of different samples in different states through acquisition equipment.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: the cross-modal signal reconstruction model comprises a feature extraction module, a signal reconstruction module, a signal identification module and a loss optimization module, wherein the feature extraction module extracts video semantic features after processing video frames of video signals; inputting the video semantic features into a signal reconstruction module, and obtaining a reconstructed touch signal after reconstruction processing; inputting the real touch signal and the reconstructed touch signal into a signal distinguishing module to distinguish true from false; and calculating the mean square error loss of the reconstructed haptic signal and the real haptic signal, generating the counterloss, and realizing gradient update of the parameters of the module by using a loss value through a counter propagation algorithm so as to optimize and generate the reconstructed signal with higher accuracy.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: in the feature extraction module, in the semantic feature extraction based on 3D CNN aiming at video signals, each video frame is subjected to scaling and clipping pretreatment; secondly, inputting the video frame image into a 3D Resnet50, and outputting video semantic features through multi-layer 3D convolution processing.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: when the real tactile signal is acquired, the real tactile signal is required to be preprocessed, wherein the real tactile signal comprises, aiming at the tactile signal in a time sequence form, using STFT to obtain a frequency spectrum, and separating a real part and an imaginary part of a complex number in a complex matrix to obtain a real tactile frequency spectrum S .
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: the signal reconstruction module comprises a step of reconstructing a frequency spectrum of a touch signal through processing of a deconvolution layer, a batch normalization layer and an activation function according to the output video semantic characteristics, and a step of obtaining the reconstructed touch signal in a time domain through inverse Fourier transform. As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: in the reconstruction processing process, the input video semantic features are sequentially processed by three deconvolution groups, wherein each deconvolution group comprises a deconvolution layer, a batch normalization layer and Relu activation functions; then a convolution group is processed, including a deconvolution layer, a batch normalization layer and a Tanh activation function; the deconvolution layer is expressed as: k= (k h,kw),p=(ph,pw), s; where k= (k h,kw) represents the convolution kernel size, p= (p h,pw) represents the zero padding number, s represents the convolution kernel sliding step size, relu activation function is y=max (0, x), and Tanh activation function isX represents the output of the batch normalization layer in the deconvolution group.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: the signal discrimination module comprises two convolution groups, a full connection layer and a Sigmoid activation function, wherein the convolution groups comprise a convolution layer of 3×3, a batch normalization layer, a Relu activation function and a maximum pooling layer.
The Sigmoid activation function isAnd taking the output of the full connection layer as a function input x, and outputting the probability that the signal belongs to the real signal.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: the loss optimization module optimizes parameters of the feature extraction module, the signal reconstruction module and the signal discrimination module by adopting a combination of a generated counterloss function and a mean square error loss function, wherein,
The generation of the countermeasures loss function is:
Wherein E (-) is the desired function, G (-) and D (-) represent the haptic signal generation network and the haptic signal discrimination network, respectively, and P data (-) represents the data distribution.
The mean square error loss function is expressed as:
Wherein s i is equal to Representing the true haptic spectrum S and the reconstructed haptic spectrum/>, respectivelyN represents the number of elements in the spectrum.
As a preferable scheme of the 6G-oriented haptic mode signal reconstruction method, the invention comprises the following steps: the training adopts a random gradient descent method, the training round is 70, the initial learning rate is 0.001, the learning rate is continuously adjusted by using a cosine annealing regulator, and the batch processing amount is set to be 6.
The invention has the beneficial effects that:
In the invention, a modal dataset VisTouch comprising video and touch is constructed for a 6G cross-modal application scene; reconstructing the video modal signals with semantic relevance into haptic modal signals based on a deep learning technology; in order to improve the signal reconstruction quality, two types of loss functions, namely the countermeasures loss and the mean square error loss, are used as target functions, training is carried out based on VisTouch, and the accuracy of the reconstruction method is verified.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
fig. 1 is a VisTouch data acquisition diagram of the 6G-oriented haptic mode signal reconstruction method of the present invention.
Fig. 2 is a diagram of a video-assisted haptic signal reconstruction model for a 6G-oriented haptic mode signal reconstruction method of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Further, in describing the embodiments of the present invention in detail, the cross-sectional view of the device structure is not partially enlarged to a general scale for convenience of description, and the schematic is only an example, which should not limit the scope of protection of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.
Example 1
Referring to fig. 1-2, for a first embodiment of the present invention, a method for reconstructing a 6G-oriented haptic mode signal is provided, the reconstruction method comprising,
S1: data samples are acquired, and VisTouch datasets are constructed that contain video and haptic mode signals.
Specifically, the data sample collection includes the following steps:
s11: and selecting an acquired sample, and classifying the acquired sample.
Materials common in life and high in practical value are selected, 47 materials are summed and classified, and the materials are used as sample categories of the VisTouch data set constructed as shown in table 1. In addition, during the sample collection process, it can be observed that the same materials have different colors due to dyeing, processing and the like, for example, glass is classified into common glass and quartz glass, and colored glass and transparent glass, which pose a certain challenge to the cross-modal information processing. For this reason, for the same type of sample, collect multiple colors as far as possible, for example synthetic textile, collect samples of four colors such as red, yellow, blue, white, etc., for glass, collect samples such as colored glass, transparent glass, frosted glass, etc., in order to reduce the influence of color on experimental results.
Table 1 VisTouch sample categories contained in dataset
S12: and selecting acquisition equipment, and synchronously setting the acquisition equipment.
In order to collect the video and haptic signals simultaneously, a suitable camera and haptic sensor need to be selected. Specific parameters of the acquisition equipment, sampling rate, resolution, etc. used for VisTouch datasets are given in table 2.
Table 2 collecting device information
S13: setting an acquisition mode, and acquiring video signals and touch signals of different samples in different states through acquisition equipment.
The touch data acquisition means is used for controlling the manipulator to slidingly touch various materials, recording sliding friction force generated by friction between fingertips and the materials in the sliding touch process as touch signals, acquiring video signals by using a high-definition camera, and synchronizing the two signals by using a timestamp.
In addition, in order to ensure accurate and low-noise collection of the touch signal, the method starts from two aspects of (1) placing the mechanical arm on a desktop and giving a constant driving force downwards perpendicular to the desktop to the mechanical arm mounted at the tail end of the mechanical arm; (2) The acquisition material is sheet-shaped to ensure the normal direction of the driving force on the contact surface, thereby reducing the influence of the material shape factor on the acquisition signal.
The sliding touch track is provided with three types of linear sliding, curve sliding and folding line sliding, and meanwhile, the constant normal driving force is provided with three types of 3N, 6N and 9N, and the three types of constant normal driving force are combined with the sliding track in a crossing way, so that 9 sliding modes (such as folding line sliding touch under 3N driving force) can be provided in total.
S2: by utilizing the semantic relevance between the two modal signals, a cross-modal signal reconstruction model with internal semantic relevance driving is constructed based on deep learning.
Furthermore, the cross-modal signal reconstruction model comprises a feature extraction module, a signal reconstruction module, a signal discrimination module and a loss optimization module.
S21: the feature extraction module extracts video semantic features after processing video frames of the video signal.
Specifically, in extracting semantic features based on a 3D CNN (three-dimensional convolutional neural network) for a video signal, firstly, performing scaling and clipping preprocessing on each video frame; secondly, the video frame image is input into a 3D Resnet50 (three-dimensional residual network), and video semantic features F R are output through multi-layer 3D convolution processing. The 3D Resnet50 can enable the learning curve to be converged rapidly by virtue of the unique residual design, meanwhile, the problem of gradient disappearance can be avoided, and the model size and the accuracy are both realized.
Let the input video signal be a 5-dimensional tensor I e R N×T×C×H×W, where N is the batch, T represents the video frame number, C represents the image channel number, and for RGB images c=3, H and W represent the height and width of the image, respectively, here the scaling and cropping preprocessing is performed on each video frame image so that the image size is unified to 224×224, i.e. h=w=224. Secondly, inputting I into the 3D resnet50, performing multi-layer 3D convolution processing, and outputting a feature map of F e R N'×T'×C'×H'×W', wherein for the 3D resnet50, T '=2, C' =2048, h '=w' =7, in order to facilitate the processing of a subsequent haptic signal reconstruction module, the reconstruction method performs shape transformation on F to obtain a four-dimensional tensor F R∈RN'×T'C'×H'×W',FR, wherein T 'C' =2×2048=4096 represents video semantic features.
For the haptic signals, preprocessing is required to be performed on the real haptic signals after the acquisition, including, for the haptic signals in a time series form, obtaining a frequency spectrum by using an STFT (short time fourier transform), wherein in the STFT, the sampling frequency is set to 1000Hz, the window width is 50, so as to obtain a complex matrix with the size of 26×41, and separating a real part and an imaginary part of the complex, so as to obtain a real haptic frequency spectrum S with the size of 2×26×41.
S22: inputting the video semantic features into a signal reconstruction module, and obtaining a reconstructed touch signal after reconstruction processing.
In particular, the present embodiment utilizes a combination of deconvolution, batch normalization, and linear activation functions to achieve cross-modal signal mapping from small to large, high to low, and semantic to target domains. According to the output video semantic features, reconstructing the frequency spectrum of the haptic signal through processing of a deconvolution layer, a batch normalization layer and an activation function, and obtaining the haptic mode signal in the time domain through Fourier inverse transformation.
The reconstruction module is provided with five layers of sub-modules, wherein the first layer is an input layer, the second to fourth layers are combinations of deconvolution layers, batch normalization layers and activation functions and are used for reconstructing the height and the width of the spectrogram, and the fifth layer is a convolution group and is used for reconstructing the channel dimension of the spectrogram.
In the reconstruction process, the input video semantic features are sequentially processed by three deconvolution groups (second to fourth layers), each deconvolution group including a deconvolution layer, a batch normalization layer, and Relu activation functions, as shown in table 3; and then processed by a convolution set (fifth layer).
The deconvolution layer is expressed as: k= (k h,kw),p=(ph,pw), s; wherein, the activation function is used for enhancing the nonlinear characterization capability of the module, the Relu function is placed at the end of the three deconvolution groups, x represents the output of the batch normalization layers in the deconvolution groups, and the Tanh function is placed at the end of the whole module and is used for generating a reconstructed touch spectrum consistent with the real spectrum distribution range.
Table 3 the signal reconstruction module specifically comprises the following structure:
Where k= (k h,kw) represents the convolution kernel size, p= (p h,pw) represents the zero padding number, s represents the convolution kernel sliding step size, relu activation function is y=max (0, x), and Tanh activation function is X represents the output of the batch normalization layer in the deconvolution group.
The signal reconstruction module in this embodiment specifically includes the following parameters:
TABLE 4 network parameters for haptic signal generation (ignore batch N)
S23: and inputting the real touch signal and the reconstructed touch signal into a signal distinguishing module to distinguish the true touch signal from the false touch signal.
Specifically, the signal discrimination module has two convolution groups, a full connection layer and a Sigmoid activation function, wherein the convolution groups comprise a convolution layer of 3×3, a batch normalization layer, a Relu activation function and a maximum pooling layer.
Sigmoid activation function isAnd taking the output of the full connection layer as a function input x, and outputting the probability that the signal belongs to the real signal.
Further, the real tactile spectrum v is combined with the reconstructed tactile spectrum generated by the signal reconstruction moduleAs the input of the signal distinguishing module, S and/> areobtained respectively through the processing of two convolution groupsCorresponding discrimination vectors v and/>Then, v is combined with/>Respectively inputting the real signals into a full connection layer and a Sigmoid function, and outputting the probability that S and S are real signals; in the network training process, we judge S as true as possible, i.e. the probability is as close to 1 as possible, and will/>The binary true and false discrimination is realized by discriminating false as far as possible, namely, the probability is as close to 0 as possible.
S24: and calculating the mean square error loss of the reconstructed haptic signal and the real haptic signal, generating the counterloss, and realizing gradient update of the parameters of the module by using a loss value through a counter propagation algorithm so as to optimize and generate the reconstructed signal with higher accuracy.
Specifically, the loss optimization module optimizes parameters of the feature extraction module, the signal reconstruction module and the signal discrimination module by adopting a combination of generating an anti-loss function and a mean square error loss function, wherein,
The generation of the countermeasures loss function is:
Wherein E (-) is the desired function, G (-) and D (-) represent the haptic signal generation network and the haptic signal discrimination network, respectively, and P data (-) represents the data distribution.
The mean square error loss function is expressed as:
Wherein s i is equal to Representing the true haptic spectrum S and the reconstructed haptic spectrum/>, respectivelyN represents the number of elements in the spectrum.
The evaluation module is used for evaluating whether the reconstructed signal is consistent with the real signal or not, and meanwhile, in the training process, the deviation of the reconstructed signal and the real signal can be subjected to gradient back propagation, training parameters of the feature extraction module and the reconstruction module are adjusted until the quality of the reconstructed signal meets the requirement or the deviation can not be continuously optimized, and the whole reconstruction model is used for mining the inherent semantic relevance among the multi-mode signals and finally generating the accurate and low-noise reconstructed signal.
S3: training a cross-modal signal reconstruction model by using the data set until the quality of the reconstructed signal meets the requirement or the deviation cannot be optimized continuously.
Specifically, the training adopts a random gradient descent method, the training round is 70, the initial learning rate is 0.001, and the learning rate is continuously adjusted by using a cosine annealing adjuster, and the batch processing amount is set to be 6. Further, the 3D CNN input size is 224×224, and the entire model is programmatically developed using Pytorch deep learning framework. On the hardware configuration, a single RTX 2080Ti graphic card is used for model training until the two loss functions are converged at the same time.
Example 2
In order to verify and explain the technical effect of the reconstruction method, as the method completes the haptic reconstruction work by utilizing VisTouch data sets for the first time and has no published reference model, the embodiment reduces the haptic reconstruction model assisted by the provided video to obtain the following two models as comparison references:
Model 1: the model structure is not changed, and the model is trained only by generating an antagonism loss function;
model 2: the haptic signal discrimination network is removed and the model is trained using only the mean square error loss function.
After determining the comparison standard, an evaluation index needs to be introduced to test the output result, and the present embodiment uses two evaluation indexes, i.e. average absolute error (MAE) and Accuracy (ACC) for measurement.
MAE: since the representation form of the haptic signal is time series, from the signal itself, the reconstructed haptic time signal is assumed to be TAnd if the sample capacity is M, the MAE calculation formula is as follows:
MAE is used to evaluate the absolute deviation of the reconstructed signal from the true signal.
ACC: firstly, a sample class classifier is pre-trained by using a real signal, after training is completed, a reconstruction signal is input, whether a discrimination result of the reconstruction signal on a sample class is consistent with the real sample class is checked, so that the accuracy ACC is counted, and in the embodiment, the classifier is realized by a multi-layer perceptron.
The statistical results of the model comparison experiments are shown in table 5, and it can be seen that the reconstruction accuracy of the reconstruction model is obviously improved in terms of structure and loss function design compared with the reconstruction accuracy of models 1 and 2.
Table 5 model comparison experiment results
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims (6)

1. A6G-oriented haptic mode signal reconstruction method is characterized by comprising the following steps: comprising the steps of (a) a step of,
Collecting data samples, and constructing a data set containing video and touch mode signals;
by utilizing the semantic relevance between two modal signals, a cross-modal signal reconstruction model with internal semantic relevance driving is constructed based on deep learning;
the cross-modal signal reconstruction model comprises a feature extraction module, a signal reconstruction module, a signal discrimination module and a loss optimization module, wherein,
The feature extraction module extracts video semantic features after processing video frames of the video signals;
Inputting the video semantic features into a signal reconstruction module, and obtaining a reconstructed touch signal after reconstruction processing;
inputting the real touch signal and the reconstructed touch signal into a signal distinguishing module to distinguish true from false;
Calculating the mean square error loss of the reconstructed haptic signal and the real haptic signal, generating the counterloss, and realizing gradient update of the parameters of the modules by the loss value through a back propagation algorithm so as to optimize and generate the reconstructed signal with higher accuracy;
The signal reconstruction module comprises a signal reconstruction module, which comprises,
Reconstructing the frequency spectrum of the haptic signal through the processing of a deconvolution layer, a batch normalization layer and an activation function according to the output video semantic features, and obtaining the reconstructed haptic signal in a time domain through Fourier inverse transformation;
During the course of the reconstruction process,
The method comprises the steps that input video semantic features are processed through three deconvolution groups in sequence, wherein each deconvolution group comprises a deconvolution layer, a batch normalization layer and Relu activation functions;
Then output after processing by a convolution group, wherein the convolution group comprises a deconvolution layer, a batch normalization layer and a Tanh activation function;
The deconvolution layer is expressed as: ,/>,s;
wherein, Representing convolution kernel size,/>Representing the number of zero padding, s representing the sliding step of the convolution kernel, relu activating the function as/>The Tanh activation function is/>X is the output of the batch normalization layer in the deconvolution group;
The signal distinguishing module comprises two convolution groups, a full connection layer and a Sigmoid activation function, wherein the convolution groups comprise a convolution layer with the size of 3 multiplied by 3, a batch normalization layer, a Relu activation function and a maximum pooling layer;
the Sigmoid activation function is Taking the output of the full connection layer as a function input x, and outputting the probability that the signal belongs to a real signal;
training a cross-modal signal reconstruction model by using the data set until the quality of the reconstructed signal meets the requirement or the deviation cannot be optimized continuously.
2. The 6G-oriented haptic mode signal reconstruction method of claim 1, wherein: the acquiring of the data samples includes,
Selecting an acquired sample, and classifying the acquired sample;
selecting acquisition equipment, and synchronously setting the acquisition equipment;
Setting an acquisition mode, and acquiring video signals and touch signals of different samples in different states through acquisition equipment.
3. The 6G-oriented haptic mode signal reconstruction method of claim 2, wherein: in the feature extraction module,
In the semantic feature extraction based on 3D CNN, firstly, scaling and clipping preprocessing is carried out on each video frame aiming at the video signal; secondly, inputting the video frame image into a 3D Resnet50, and outputting video semantic features through multi-layer 3D convolution processing.
4. A method of 6G-oriented haptic mode signal reconstruction as recited in claim 3 wherein: the real tactile signal is collected, preprocessed, including,
For the haptic signal in the time sequence form, the frequency spectrum is obtained by using STFT, and the real part and the imaginary part of the complex number in the complex matrix are separated to obtain the real haptic frequency spectrum S.
5. The 6G-oriented haptic mode signal reconstruction method of claim 4, wherein: the loss optimization module optimizes parameters of the feature extraction module, the signal reconstruction module and the signal discrimination module by adopting a combination of a generated counterloss function and a mean square error loss function, wherein,
The generation of the countermeasures loss function is:
wherein, Is a desired function,/>And/>Representing a haptic signal generation network and a haptic signal discrimination network, respectively,/>Representing a data distribution;
The mean square error loss function is expressed as:
wherein, And/>Representing the true haptic spectrum S and the reconstructed haptic spectrum/>, respectively(1 /)Element of individual position,/>Representing the number of elements in the spectrum.
6. The 6G-oriented haptic mode signal reconstruction method of claim 5, wherein: the training adopts a random gradient descent method, the training round is 70, the initial learning rate is 0.001, the learning rate is continuously adjusted by using a cosine annealing regulator, and the batch processing amount is set to be 6.
CN202210476817.0A 2022-04-30 2022-04-30 6G-oriented haptic mode signal reconstruction method Active CN114842384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210476817.0A CN114842384B (en) 2022-04-30 2022-04-30 6G-oriented haptic mode signal reconstruction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210476817.0A CN114842384B (en) 2022-04-30 2022-04-30 6G-oriented haptic mode signal reconstruction method

Publications (2)

Publication Number Publication Date
CN114842384A CN114842384A (en) 2022-08-02
CN114842384B true CN114842384B (en) 2024-05-31

Family

ID=82568112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210476817.0A Active CN114842384B (en) 2022-04-30 2022-04-30 6G-oriented haptic mode signal reconstruction method

Country Status (1)

Country Link
CN (1) CN114842384B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905838A (en) * 2022-11-18 2023-04-04 南京邮电大学 Audio-visual auxiliary fine-grained tactile signal reconstruction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627482A (en) * 2021-07-09 2021-11-09 南京邮电大学 Cross-mode image generation method and device based on audio-tactile signal fusion
CN113628294A (en) * 2021-07-09 2021-11-09 南京邮电大学 Image reconstruction method and device for cross-modal communication system
CN113642604A (en) * 2021-07-09 2021-11-12 南京邮电大学 Audio and video auxiliary tactile signal reconstruction method based on cloud edge cooperation
WO2022047625A1 (en) * 2020-09-01 2022-03-10 深圳先进技术研究院 Image processing method and system, and computer storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461235B (en) * 2020-03-31 2021-07-16 合肥工业大学 Audio and video data processing method and system, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022047625A1 (en) * 2020-09-01 2022-03-10 深圳先进技术研究院 Image processing method and system, and computer storage medium
CN113627482A (en) * 2021-07-09 2021-11-09 南京邮电大学 Cross-mode image generation method and device based on audio-tactile signal fusion
CN113628294A (en) * 2021-07-09 2021-11-09 南京邮电大学 Image reconstruction method and device for cross-modal communication system
CN113642604A (en) * 2021-07-09 2021-11-12 南京邮电大学 Audio and video auxiliary tactile signal reconstruction method based on cloud edge cooperation

Also Published As

Publication number Publication date
CN114842384A (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN111930992B (en) Neural network training method and device and electronic equipment
US11645835B2 (en) Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
Bhavana et al. Hand sign recognition using CNN
CN108921037B (en) Emotion recognition method based on BN-acceptance double-flow network
CN111954250B (en) Lightweight Wi-Fi behavior sensing method and system
CN115349860A (en) Multi-modal emotion recognition method, system, device and medium
CN114863572B (en) Myoelectric gesture recognition method of multi-channel heterogeneous sensor
CN113255602A (en) Dynamic gesture recognition method based on multi-modal data
CN114842384B (en) 6G-oriented haptic mode signal reconstruction method
CN112668486A (en) Method, device and carrier for identifying facial expressions of pre-activated residual depth separable convolutional network
CN111489405B (en) Face sketch synthesis system for generating confrontation network based on condition enhancement
Hongmeng et al. A detection method for deepfake hard compressed videos based on super-resolution reconstruction using CNN
Rwelli et al. Gesture based Arabic sign language recognition for impaired people based on convolution neural network
Cosovic et al. Classification methods in cultural heritage
CN113538662B (en) Single-view three-dimensional object reconstruction method and device based on RGB data
CN116758621B (en) Self-attention mechanism-based face expression depth convolution identification method for shielding people
Liu et al. Audiovisual cross-modal material surface retrieval
He et al. An optimal 3D convolutional neural network based lipreading method
CN114944002B (en) Text description-assisted gesture-aware facial expression recognition method
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
Laghari et al. Dorsal hand vein identification using transfer learning from AlexNet
CN114170540A (en) Expression and gesture fused individual emotion recognition method
CN113034475A (en) Finger OCT (optical coherence tomography) volume data denoising method based on lightweight three-dimensional convolutional neural network
Ma et al. Dynamic sign language recognition based on improved residual-lstm network
Gan et al. Target Detection and Network Optimization: Deep Learning in Face Expression Feature Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant