WO2022228174A1 - 一种渲染方法及相关设备 - Google Patents
一种渲染方法及相关设备 Download PDFInfo
- Publication number
- WO2022228174A1 WO2022228174A1 PCT/CN2022/087353 CN2022087353W WO2022228174A1 WO 2022228174 A1 WO2022228174 A1 WO 2022228174A1 CN 2022087353 W CN2022087353 W CN 2022087353W WO 2022228174 A1 WO2022228174 A1 WO 2022228174A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio track
- sound
- sounding
- sound source
- rendering
- Prior art date
Links
- 238000009877 rendering Methods 0.000 title claims abstract description 422
- 238000000034 method Methods 0.000 title claims abstract description 162
- 238000000926 separation method Methods 0.000 claims description 82
- 238000012549 training Methods 0.000 claims description 75
- 230000006870 function Effects 0.000 claims description 56
- 230000015654 memory Effects 0.000 claims description 46
- 230000001755 vocal effect Effects 0.000 claims description 36
- 238000003860 storage Methods 0.000 claims description 16
- 238000012546 transfer Methods 0.000 claims description 16
- 230000001133 acceleration Effects 0.000 claims description 14
- 238000013519 translation Methods 0.000 claims description 11
- 230000002194 synthesizing effect Effects 0.000 claims description 7
- 230000015572 biosynthetic process Effects 0.000 claims 1
- 238000003786 synthesis reaction Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 33
- 238000004519 manufacturing process Methods 0.000 abstract description 19
- 230000002452 interceptive effect Effects 0.000 description 53
- 238000013528 artificial neural network Methods 0.000 description 51
- 238000013527 convolutional neural network Methods 0.000 description 32
- 230000033001 locomotion Effects 0.000 description 32
- 238000011176 pooling Methods 0.000 description 32
- 230000008569 process Effects 0.000 description 31
- 239000011159 matrix material Substances 0.000 description 29
- 230000004044 response Effects 0.000 description 27
- 238000010586 diagram Methods 0.000 description 25
- 230000003993 interaction Effects 0.000 description 22
- 238000012545 processing Methods 0.000 description 22
- 239000013598 vector Substances 0.000 description 18
- 238000012360 testing method Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 15
- 238000001228 spectrum Methods 0.000 description 14
- 230000001537 neural effect Effects 0.000 description 12
- 238000004891 communication Methods 0.000 description 10
- 238000007654 immersion Methods 0.000 description 9
- 238000001514 detection method Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 6
- 241001342895 Chorus Species 0.000 description 5
- 230000008878 coupling Effects 0.000 description 5
- 238000010168 coupling process Methods 0.000 description 5
- 238000005859 coupling reaction Methods 0.000 description 5
- HAORKNGNJCEJBX-UHFFFAOYSA-N cyprodinil Chemical compound N=1C(C)=CC(C2CC2)=NC=1NC1=CC=CC=C1 HAORKNGNJCEJBX-UHFFFAOYSA-N 0.000 description 5
- 230000008451 emotion Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 230000008054 signal transmission Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 241000699666 Mus <mouse, genus> Species 0.000 description 2
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 210000005069 ears Anatomy 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000000630 rising effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000004091 panning Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
- H04S7/303—Tracking of listener position or orientation
- H04S7/304—For headphones
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44012—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/008—Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/301—Automatic calibration of stereophonic sound system, e.g. with test microphone
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/40—Visual indication of stereophonic sound image
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/01—Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/11—Positioning of individual sound objects, e.g. moving airplane, within a sound field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/01—Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
Definitions
- the present application relates to the field of audio applications, and in particular, to a rendering method and related equipment.
- audio and video playback devices can use processing technologies such as head related transfer function (HRTF) to process the audio and video data to be played.
- HRTF head related transfer function
- the embodiment of the present application provides a rendering method, which can improve the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file, and provide users with immersive stereoscopic sound effects.
- a first aspect of the embodiments of the present application provides a rendering method, which can be applied to scenes such as music, film and television production, etc.
- the method can be executed by a rendering device, or can be executed by a component of the rendering device (for example, a processor, a chip, or a chip system). etc.) execute.
- the method includes: acquiring a first single-object audio track based on a multimedia file, where the first single-object audio track corresponds to a first sound-emitting object; determining a first sound source position of the first sound-emitting object based on reference information, where the reference information includes reference position information and /or media information of the multimedia file, the reference position information is used to indicate the position of the first sound source; spatially render the first single-object audio track based on the first sound source position to obtain the rendered first single-object audio track.
- the first single-object soundtrack is obtained based on the multimedia file, and the first single-object soundtrack corresponds to the first sound-emitting object; the position of the first sound source of the first sound-emitting object is determined based on the reference information, and the first sound source position to spatially render the first single-object audio track to obtain the rendered first single-object audio track.
- the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects.
- the media information in the above steps includes: the text to be displayed in the multimedia file, the image to be displayed in the multimedia file, and the music of the music to be played in the multimedia file. at least one of a feature and a sound source type corresponding to the first sound-emitting object.
- the rendering device can perform orientation and dynamic settings on the extracted specific sound-emitting object according to the music features of the music, so that the sound track corresponding to the sound-emitting object is rendered in 3D. It is more natural and the artistry is better reflected. If the media information includes text, images, etc., the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, so that the user can obtain the best sound effect experience. In addition, if the media information includes video, tracking the sounding objects in the video and rendering the audio tracks corresponding to the sounding objects in the entire video can also be used in professional mixing post-production to improve the mixer's work efficiency.
- the reference location information in the above steps includes first location information of the sensor or second location information selected by the user.
- the reference position information includes the first position information of the sensor
- the user can perform real-time or later dynamic rendering on the selected sound-emitting object through the orientation or position provided by the sensor.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- the reference position information includes the second position information selected by the user
- the user can control the selected sound-emitting object by dragging the interface and perform real-time or later dynamic rendering, giving it a specific spatial orientation and motion, which can realize the user Interactive creation with audio to provide users with a new experience.
- the above steps further include: determining the type of the playback device, the playback device is used to play the target audio track, and the target audio track is based on the rendered first single-object audio track. Acquiring: performing spatial rendering on the first single-object audio track based on the first sound source position, including: performing spatial rendering on the first single-object audio track based on the first sound source position and the type of playback device.
- the type of playback device is considered when spatially rendering the audio track.
- Different playback device types can correspond to different spatial rendering formulas, so that the spatial effect of the rendered first single-object audio track by the playback device in the later stage is more realistic and accurate.
- the reference information in the above steps includes media information, and when the media information includes an image and the image includes a first sound-emitting object, the reference information is used to determine the first sound-emitting object.
- the first sound source position includes: determining third position information of the first sound-emitting object in the image, where the third position information includes the two-dimensional coordinates and depth of the first sound-emitting object in the image; acquiring the first sound source based on the third position information Location.
- the 3D immersion is rendered in the earphone or external environment, and the real sound follows the picture. Allows users to obtain the best sound experience.
- the technology of tracking and rendering the audio of the object in the entire video after selecting the sounding object can also be applied in professional mixing post-production, improving the work efficiency of the mixer.
- the reference information in the above steps includes media information, and when the media information includes the music characteristics of the music to be played in the multimedia file, the first utterance is determined based on the reference information.
- the first sound source position of the object includes: determining the first sound source position based on an association relationship and a music feature, where the association relationship is used to represent the association between the music feature and the first sound source position.
- the orientation and dynamics of the extracted specific sound-emitting objects are set according to the musical characteristics of the music, so that the 3D rendering is more natural and the artistry is better reflected.
- the media information in the above steps includes media information, and when the media information includes text to be displayed in the multimedia file and the text contains position-related position text, Determining the position of the first sound source of the first sound-emitting object based on the reference information includes: identifying the position text; and determining the position of the first sound source based on the position text.
- the reference information in the above steps includes reference position information, and when the reference position information includes the first position information, the first position of the first sounding object is determined based on the reference information.
- the method further includes: acquiring first position information, where the first position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; determining the first sound source position of the first sound-emitting object based on the reference information , including: converting the first position information into the first sound source position.
- the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation (ie, the first attitude angle) provided by the sensor.
- the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- the reference information in the above steps includes reference position information, and when the reference position information includes the first position information, the first position of the first sounding object is determined based on the reference information.
- the method further includes: acquiring first position information, where the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; determining the first sound source position of the first sound-emitting object based on the reference information, including: A position information is converted into the position of the first sound source.
- the user can use the actual position information of the sensor as the sound source position to control the sound-emitting object and perform real-time or later dynamic rendering, and the movement trajectory of the sound-emitting object can simply be completely controlled by the user. Increase editing flexibility.
- the reference information in the above steps includes reference position information, and when the reference position information includes the second position information, the first utterance object of the first sound is determined based on the reference information.
- the method further includes: providing a spherical view for the user to select, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device; Position information; determining the position of the first sound source of the first sound-emitting object based on the reference information, including: converting the second position information into the position of the first sound source.
- the user can select the second position information (such as clicking, dragging, sliding, etc.) through the spherical view to control the selected sound-emitting object and perform real-time or later dynamic rendering to give it a specific space Orientation and motion can realize interactive creation between users and audio, providing users with a new experience.
- the second position information such as clicking, dragging, sliding, etc.
- the above steps: acquiring the first single-object audio track based on the multimedia file includes: separating the first single-object audio track from the original audio track in the multimedia file. , the original audio track is obtained by synthesizing at least the first single-object audio track and the second single-object audio track, and the second single-object audio track corresponds to the second sounding object.
- the original audio track is composed of at least the first single-object audio track and the second single-object audio track
- Object building space rendering enhances the user's ability to edit audio, and can be applied to the object production of music and film and television works. Increase the user's control and playability of music.
- separating the first single-object audio track from the original audio track in the multimedia file includes: separating the original audio track from the original audio track through a trained separation network. to separate out the first single-object track.
- the first single-object audio track is separated through a separation network, and the The specific sound-emitting object can be spatially rendered to enhance the user's ability to edit audio, which can be applied to the object production of music and film and television works. Increase the user's control and playability of music.
- the trained separation network in the above steps is to separate the target pair by taking the training data as the input of the separation network and taking the value of the loss function less than the first threshold as the target pair.
- the network is trained and acquired, and the training data includes a training audio track.
- the training audio track is obtained by at least combining the initial third single-object audio track and the initial fourth single-object audio track.
- the initial third single-object audio track corresponds to the third vocal object
- the initial The fourth single-object soundtrack corresponds to the fourth sounding object
- the third sounding object is of the same type as the first sounding object
- the second sounding object is of the same type as the fourth sounding object
- the output of the separation network includes the separately obtained No.
- Three single-object tracks; the loss function is used to indicate the difference between the separately obtained third single-object track and the original third single-object track.
- the separation network is trained with the goal of reducing the value of the loss function, that is, the difference between the third single-object audio track output by the separation network and the initial third single-object audio track is continuously reduced. This makes the single-object audio track separated by the separation network more accurate.
- the above steps perform spatial rendering on the first single-object audio track based on the position of the first sound source and the type of the playback device, including: if the playback device is an earphone, by The following formula obtains the first single-object audio track after rendering;
- the above steps perform spatial rendering on the first single-object audio track based on the position of the first sound source and the type of the playback device, including: if the playback devices are N external Play the device, and obtain the rendered first single-object audio track through the following formula;
- ⁇ i is the calibrator to calibrate the ith external speaker
- ⁇ i is the inclination angle obtained by the calibrator to calibrate the ith external device
- ri is the distance between the ith external device and the calibrator
- N is a positive integer
- i is a positive integer
- i ⁇ N the position of the first sound source is within a tetrahedron formed by N external devices.
- the above steps further include: obtaining the target audio track based on the rendered first single-object audio track, the original audio track in the multimedia file, and the type of the playback device. ; Send the target audio track to the playback device, which is used to play the target audio track.
- the target audio track can be obtained, which facilitates saving the rendered audio track, facilitates subsequent playback, and reduces repeated rendering operations.
- the above steps are based on the rendered first single-object audio track, the original audio track in the multimedia file, and the type of the playback device, and obtaining the target audio track includes: if The type of playback device is headphones, and the target audio track is obtained by the following formula:
- i indicates left or right channel
- X i (t) is the original track at time t
- X i (t) is the original track at time t
- a s (t) is the adjustment coefficient of the first sounding object at time t
- hi ,s (t) is the left channel corresponding to the first sounding object at time t or Right channel head correlation transfer function
- HRTF filter coefficients are related to the position of the first sound source
- o s (t) is the first single-object track at time t
- ⁇ is the integral term
- S 1 is the original The sounding object that needs to be replaced in the audio track, if the first sounding object is to replace the sounding object in the original audio track, then S1 is an empty set ;
- S2 is the sounding object added by the target audio track compared to the original audio track, if If the first sounding object is the sounding object in the copied original audio track, then S2
- the playback device when the playback device is an earphone, the technical problem of how to obtain the target audio track is solved. It is convenient to save the rendered audio track, which is convenient for subsequent playback and reduces repeated rendering operations.
- the above steps obtain the target audio track based on the rendered first single-object audio track, the original audio track in the multimedia file, and the type of the playback device, including: If the type of playback device is N external devices, the target audio track is obtained by the following formula:
- i indicates the ith channel in the multi-channel
- X i (t) is the original track at time t
- X i (t) is the original track at time t
- a s (t) is the adjustment coefficient of the first sound-emitting object at time t
- g s (t) represents the translation coefficient of the first sound-emitting object at time t
- g i,s ( t) represents the ith row in g s (t)
- o s (t) is the first single-object track at time t
- S 1 is the sounding object that needs to be replaced in the original soundtrack.
- S1 is an empty set
- S2 is the sounding object added by the target soundtrack compared to the original soundtrack.
- S 2 is an empty set
- S 1 and/or S 2 are the sound-emitting objects of the multimedia file and include the first sound-emitting object
- ⁇ i is the azimuth angle obtained by the calibrator calibrating the i-th external device
- ⁇ i is the calibrator calibration
- the inclination angle obtained by the ith external device, ri is the distance between the ith external device and the calibrator
- N is a positive integer
- i is a positive integer
- i ⁇ N the first sound source is located in N external speakers within the tetrahedron formed by the device.
- the playback device when the playback device is an external playback device, the technical problem of how to obtain the target audio track is solved. It is convenient to save the rendered audio track, which is convenient for subsequent playback and reduces repeated rendering operations.
- the music features in the above steps include: at least one of music structure, music emotion, and singing mode.
- the above steps further include: separating a second single-object audio track from the multimedia file; determining the second sound source position of the second sound-emitting object based on the reference information, The second single-object audio track is spatially rendered based on the second sound source position to obtain a rendered second single-object audio track.
- At least two single-object audio tracks can be separated from the multimedia file, and corresponding spatial rendering can be performed to enhance the user's ability to edit specific sound-emitting objects in the audio, which can be applied to music, film and television works. object making. Increase the user's control and playability of music.
- a second aspect of the embodiments of the present application provides a rendering method, which can be executed by a rendering device, and can be applied to scenes such as music, film and television production production, etc., or a component of the rendering device (such as a processor, a chip, or a chip system) etc.) execute.
- the method includes: acquiring a multimedia file; acquiring a first single-object audio track based on the multimedia file, where the first single-object audio track corresponds to a first sounding object; displaying a user interface, the user interface including a rendering mode option; One operation, determine the automatic rendering mode or the interactive rendering mode from the rendering mode options; when the automatic rendering mode is determined, the first rendered single-object audio track is obtained based on the preset mode; or when the interactive rendering mode is determined , in response to the user's second operation to obtain reference position information; determine the first sound source position of the first sound-emitting object based on the reference position information; render the first single-object track based on the first sound source position to obtain the rendered 's first single-object track.
- the rendering device determines the automatic rendering mode or the interactive rendering mode from the rendering mode options according to the first operation of the user.
- the rendering device may automatically acquire the rendered first single-object audio track based on the user's first operation.
- the spatial rendering of the audio track corresponding to the first sound-emitting object in the multimedia file can be realized through the interaction between the rendering device and the user, so as to provide the user with an immersive stereo sound effect.
- the preset manner in the above steps includes: acquiring media information of the multimedia file; determining the position of the first sound source of the first sounding object based on the media information; The sound source position renders the first single-object audio track to obtain the rendered first single-object audio track.
- the media information in the above steps includes: the text to be displayed in the multimedia file, the image to be displayed in the multimedia file, and the music of the music to be played in the multimedia file. at least one of a feature and a sound source type corresponding to the first sound-emitting object.
- the rendering device determines the multimedia file to be processed through interaction with the user, thereby increasing the user's controllability and playability of the music in the multimedia file.
- the reference location information in the above steps includes first location information of the sensor or second location information selected by the user.
- the type of the playback device is determined through the user's operation. Different playback device types can correspond to different spatial rendering formulas, so that the spatial effect of the rendered audio track played by the playback device in the later stage is more realistic and accurate.
- the above step when the media information includes an image and the image includes a first sound-emitting object, determining the first sound source position of the first sound-emitting object based on the media information, including: : present the image; determine third position information of the first sound-emitting object in the image, where the third position information includes the two-dimensional coordinates and depth of the first sound-emitting object in the image; obtain the first sound source position based on the third position information.
- the rendering device may automatically present the image and determine the sound-emitting object in the image, obtain third position information of the sound-emitting object, and then obtain the position of the first sound source.
- the rendering device can automatically identify the multimedia file, and when the multimedia file includes an image and the image includes the first sound-emitting object, the rendering device can automatically acquire the rendered first single-object audio track.
- the 3D immersion is rendered in the headset or external environment, and the real sound moves with the picture, allowing users to obtain the best sound effect experience.
- determining the third position information of the first sound-emitting object in the image includes: in response to a third operation by the user on the image, determining the location of the first sound-emitting object. third location information.
- the user may select the first sound-emitting object from the plurality of sound-emitting objects in the presented image, that is, the user may select the first single-object audio track corresponding to the rendered first sound-emitting object.
- the coordinates of the sounding object and the single-object soundtrack are extracted, and the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, so that the user can obtain the best sound effect experience.
- the first sound source position of the first sounding object is determined based on the media information
- the method includes: identifying the music feature; determining the first sound source position based on the association relationship and the music feature, and the association relationship is used to represent the association between the music feature and the first sound source position.
- the orientation and dynamics of the extracted specific sound-emitting objects are set according to the musical characteristics of the music, so that the 3D rendering is more natural and the artistry is better reflected.
- the first sounding object is determined based on the media information.
- the position of the first sound source includes: identifying the position text; determining the position of the first sound source based on the position text.
- obtaining the reference location information in response to a second operation by the user includes: responding to the user's response to the sensor.
- the second operation is to acquire first position information, where the first position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; determining the position of the first sound source of the first sound-emitting object based on the reference position information, including: The first position information is converted into a first sound source position.
- the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation (ie, the first attitude angle) provided by the sensor.
- the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- obtaining the reference location information in response to a second operation by the user includes: responding to the user's response to the sensor.
- the second operation is to acquire first position information, where the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; determining the position of the first sound source of the first sounding object based on the reference position information, including: converting the first position information is the position of the first sound source.
- the user can use the actual position information of the sensor as the sound source position to control the sound-emitting object and perform real-time or later dynamic rendering, and the movement trajectory of the sound-emitting object can simply be completely controlled by the user. Increase editing flexibility.
- obtaining the reference position information in response to the second operation of the user includes: presenting a spherical view, the ball The center of the view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device; in response to the user's second operation, the second position information is determined in the spherical view; based on the reference position information, the first sounding object is determined.
- the first sound source position includes: converting the second position information into the first sound source position.
- the user can select the second position information (such as clicking, dragging, sliding, etc.) through the spherical view to control the selected sound-emitting object and perform real-time or later dynamic rendering to give it a specific space Orientation and motion can realize interactive creation between users and audio, providing users with a new experience.
- the second position information such as clicking, dragging, sliding, etc.
- the above step: acquiring a multimedia file includes: determining a multimedia file from at least one stored multimedia file in response to a fourth operation of the user.
- the multimedia file may be determined from at least one stored multimedia file based on the user's selection, and then the rendering and production of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file selected by the user is realized, Improve user experience.
- the above-mentioned user interface further includes a playback device type option; the method further includes: in response to a fifth operation of the user, determining the type of the playback device from the playback device type option ; Render the first single-object soundtrack based on the first sound source position to obtain the rendered first single-object soundtrack, including: rendering the first single-object soundtrack based on the first sound source position and type, to obtain the rendered first single-object soundtrack Get the rendered first single-object audio track.
- a rendering mode suitable for the playback device being used by the user is selected, thereby improving the rendering effect of the playback device and making the 3D rendering more natural.
- the above steps: acquiring the first single-object audio track based on the multimedia file includes: separating the first single-object audio track from the original audio track in the multimedia file. , the original audio track is obtained by synthesizing at least the first single-object audio track and the second single-object audio track, and the second single-object audio track corresponds to the second sounding object.
- the original audio track is composed of at least the first single-object audio track and the second single-object audio track
- Object building space rendering enhances the user's ability to edit audio, and can be applied to the object production of music and film and television works. Increase the user's control and playability of music.
- the first single-object audio track can be separated from the multimedia file, so as to render the single-object audio track corresponding to the specific sound-emitting object in the multimedia file, so as to improve the user's audio creation and improve the user's experience.
- the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation or position provided by the sensor.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation (ie, the first attitude angle) provided by the sensor.
- the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- the actual position information of the sensor is used as the position of the sound source to control the sound-emitting object and perform dynamic rendering in real time or later, so that the movement trajectory of the sound-emitting object can be simply and completely controlled by the user, which greatly increases editing. flexibility.
- the user can control the selected sound-emitting object by dragging the interface and perform real-time or later dynamic rendering, giving it specific spatial orientation and motion, which can realize interactive creation between users and audio , to provide users with a new experience.
- the rendering device can set the orientation and dynamics of the extracted specific sound-emitting object according to the musical characteristics of the music, so that the sound track corresponding to the sound-emitting object is more natural and artistic in 3D rendering. well represented.
- the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, so that the user can obtain the best sound effect experience.
- the rendering device can automatically track the sound-emitting object in the video after determining the sound-emitting object, and render the audio track corresponding to the sound-emitting object in the entire video, which can also be used in professional sound mixing post-production. Improve mixer productivity.
- the rendering device may determine the sound-emitting object in the image according to the fourth operation of the user, track the sound-emitting object in the image, and render the audio track corresponding to the sound-emitting object.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- the music features in the above steps include: at least one of music structure, music emotion, and singing mode.
- the above steps further include: separating a second single-object audio track from the original audio track; determining the second sound source position of the second sounding object based on the reference information , performing spatial rendering on the second single-object audio track based on the second sound source position to obtain a rendered second single-object audio track.
- At least two single-object audio tracks can be separated from the original audio track, and corresponding spatial rendering can be performed to enhance the user's ability to edit specific sound-emitting objects in the audio, which can be applied to music, film and television works object production. Increase the user's control and playability of music.
- a third aspect of the present application provides a rendering device, which can be applied to scenes such as music, film and television production, and the like, and the rendering device includes:
- an acquisition unit configured to acquire a first single-object audio track based on the multimedia file, and the first single-object audio track corresponds to the first sounding object;
- a determining unit configured to determine the first sound source position of the first sound-emitting object based on reference information, the reference information includes reference position information and/or media information of a multimedia file, and the reference position information is used to indicate the first sound source position;
- the rendering unit is configured to perform spatial rendering on the first single-object audio track based on the position of the first sound source, so as to obtain the rendered first single-object audio track.
- the above-mentioned media information includes: text to be displayed in the multimedia file, images to be displayed in the multimedia file, music characteristics of the music to be played in the multimedia file, and At least one of the sound source types corresponding to the first sound-emitting object.
- the above-mentioned reference location information includes first location information of the sensor or second location information selected by the user.
- the above-mentioned determining unit is also used to determine the type of the playback device, and the playback device is used to play the target audio track, and the target audio track is based on the rendered first audio track.
- the object audio track is acquired; the rendering unit is specifically configured to perform spatial rendering on the first single-object audio track based on the position of the first sound source and the type of the playback device.
- the above-mentioned reference information includes media information, and when the media information includes an image and the image includes a first sounding object, the determining unit is specifically configured to determine the first sound in the image.
- Third position information of the sound-emitting object where the third position information includes two-dimensional coordinates and depth of the first sound-emitting object in the image; the determining unit is specifically configured to acquire the position of the first sound source based on the third position information.
- the above-mentioned reference information includes media information
- the media information includes the music characteristics of the music to be played in the multimedia file
- the determining unit is specifically used for based on the association relationship.
- the first sound source position is determined with the music feature, and the association relationship is used to represent the association between the music feature and the first sound source position.
- the above-mentioned media information includes media information, and when the media information includes text that needs to be displayed in the multimedia file and the text contains position-related text, the determining unit , which is specifically used to identify the position characters; the determining unit is specifically used to determine the position of the first sound source based on the position characters.
- the above-mentioned reference information includes reference position information, and when the reference position information includes the first position information, the obtaining unit is further configured to obtain the first position information, and the first position information is obtained.
- the position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
- the above-mentioned reference information includes reference position information, and when the reference position information includes the first position information, the obtaining unit is further configured to obtain the first position information, and the first position information is obtained.
- the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
- the above-mentioned reference information includes reference position information
- the rendering device further includes: a providing unit for providing a spherical view For the user to choose, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device; the acquiring unit is also used to acquire the second position information selected by the user in the spherical view; the determining unit, Specifically, it is used to convert the second position information into the first sound source position.
- the above-mentioned acquisition unit is specifically used to separate the first single-object audio track from the original audio track in the multimedia file, and the original audio track is composed of at least the first audio track.
- the single-object audio track and the second single-object audio track are synthesized and obtained, and the second single-object audio track corresponds to the second sounding object.
- the above obtaining unit is specifically configured to separate the first single-object audio track from the original audio track by using a trained separation network.
- the above-mentioned trained separation network is performed by taking the training data as the input of the separation network and taking the value of the loss function less than the first threshold as the target to perform the separation network.
- Training acquisition the training data includes a training audio track
- the training audio track is obtained by at least synthesizing the initial third single-object audio track and the initial fourth single-object audio track
- the initial third single-object audio track corresponds to the third sounding object
- the initial fourth The single-object soundtrack corresponds to the fourth sounding object
- the third sounding object is of the same type as the first sounding object
- the second sounding object is of the same type as the fourth sounding object
- the output of the separation network includes the third sound produced by separation.
- Object track the loss function is used to indicate the difference between the separately obtained third single-object track and the original third single-object track.
- the obtaining unit is specifically configured to obtain the rendered first single-object audio track by the following formula
- the obtaining unit is specifically used to obtain the rendered first single-object audio track by the following formula
- ⁇ i is the calibrator to calibrate the ith external speaker
- ⁇ i is the inclination angle obtained by the calibrator to calibrate the ith external device
- ri is the distance between the ith external device and the calibrator
- N is a positive integer
- i is a positive integer
- i ⁇ N the position of the first sound source is within a tetrahedron formed by N external devices.
- the above-mentioned obtaining unit is further configured to obtain the target track based on the rendered first single-object track and the original track in the multimedia file; rendering;
- the device further includes: a sending unit, used for sending the target audio track to the playback device, and the playback device is used for playing the target audio track.
- the obtaining unit is specifically configured to obtain the target audio track by the following formula:
- i indicates left or right channel
- X i (t) is the original track at time t
- X i (t) is the original track at time t
- a s (t) is the adjustment coefficient of the first sounding object at time t
- hi is the left channel or right channel corresponding to the first sounding object at time t
- HRTF filter coefficient of the channel is related to the position of the first sound source
- o s (t) is the first single-object track at time t
- ⁇ is the integral term
- S 1 is the original The sounding object that needs to be replaced in the audio track, if the first sounding object is to replace the sounding object in the original audio track, then S1 is an empty set ;
- S2 is the sounding object added by the target audio track compared to the original audio track, if The first sounding object is the sounding object in the copied original audio track, then
- the obtaining unit is specifically configured to obtain the target audio track by the following formula:
- i indicates the ith channel in the multi-channel
- X i (t) is the original track at time t
- X i (t) is the original track at time t
- a s (t) is the adjustment coefficient of the first sound-emitting object at time t
- g s (t) represents the translation coefficient of the first sound-emitting object at time t
- g i,s ( t) represents the ith row in g s (t)
- o s (t) is the first single-object track at time t
- S 1 is the sounding object that needs to be replaced in the original soundtrack.
- S1 is an empty set
- S2 is the sounding object added by the target soundtrack compared to the original soundtrack.
- S 2 is an empty set
- S 1 and/or S 2 are the sound-emitting objects of the multimedia file and include the first sound-emitting object
- ⁇ i is the azimuth angle obtained by the calibrator calibrating the i-th external device
- ⁇ i is the calibrator calibration
- the tilt angle obtained by the ith external device, ri is the distance between the ith external device and the calibrator
- N is a positive integer
- i is a positive integer
- i ⁇ N the first sound source is located outside N Put the device inside the tetrahedron.
- a fourth aspect of the present application provides a rendering device, which can be applied to scenes such as music, film and television production, and the like, and the rendering device includes:
- the obtaining unit is also used to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
- a display unit for displaying a user interface, the user interface including rendering mode options
- a determination unit used for determining the automatic rendering mode or the interactive rendering mode from the rendering mode options in response to the first operation of the user on the user interface;
- the obtaining unit is further configured to obtain the rendered first single-object audio track based on the preset mode when the automatic rendering mode is determined by the determining unit; or
- the obtaining unit is further configured to obtain the reference position information in response to the second operation of the user when the determination unit determines the interactive rendering mode; determine the first sound source position of the first sound-emitting object based on the reference position information; based on the first sound The source position renders the first single-object audio track to obtain the rendered first single-object audio track.
- the above-mentioned preset manner includes: an acquiring unit, further configured to acquire media information of the multimedia file; and a determining unit, further configured to determine the first utterance based on the media information The first sound source position of the object; the obtaining unit is further configured to render the first single-object audio track based on the first sound source position, so as to obtain the rendered first single-object audio track.
- the above-mentioned media information includes: text to be displayed in the multimedia file, images to be displayed in the multimedia file, music characteristics of the music to be played in the multimedia file, and At least one of the sound source types corresponding to the first sound-emitting object.
- the above-mentioned reference location information includes first location information of the sensor or second location information selected by the user.
- the determining unit when the above-mentioned media information includes an image and the image includes a first sound-emitting object, the determining unit is specifically configured to present the image; the determining unit is specifically configured to determine the image
- the third position information of the first sound-emitting object in the image includes the two-dimensional coordinates and depth of the first sound-emitting object in the image; the determining unit is specifically configured to obtain the position of the first sound source based on the third position information.
- the above-mentioned determining unit is specifically configured to determine the third position information of the first sound-emitting object in response to a third operation of the user on the image.
- the determining unit is specifically used to identify the music feature
- the determining unit is specifically configured to determine the position of the first sound source based on the association relationship and the music feature, and the association relationship is used to represent the association between the music feature and the position of the first sound source.
- the determining unit is specifically used to identify the position. Text; a determining unit, specifically configured to determine the location of the first sound source based on the location text.
- the determining unit when the above-mentioned reference position information includes the first position information, the determining unit is specifically configured to respond to the user's second operation on the sensor to obtain the first position information.
- the first position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
- the determining unit when the above-mentioned reference position information includes the first position information, the determining unit is specifically configured to respond to the user's second operation on the sensor to obtain the first position information.
- the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
- the determining unit when the above-mentioned reference position information includes the second position information, the determining unit is specifically configured to present a spherical view, and the center of the spherical view is the position of the user, The radius of the spherical view is the distance between the user's position and the playback device; the determining unit is specifically used to respond to the user's second operation, and determine the second position information in the spherical view; the determining unit is specifically used to convert the second position information into The position of the first sound source.
- the above obtaining unit is specifically configured to determine a multimedia file from at least one stored multimedia file in response to a fourth operation of the user.
- the above-mentioned user interface further includes a playback device type option; the determining unit is further configured to respond to the fifth operation of the user, and determine the playback device from the playback device type option.
- the type of the ; obtaining unit which is specifically configured to render the first single-object audio track based on the position and type of the first sound source, so as to obtain the rendered first single-object audio track.
- the above-mentioned acquisition unit is specifically used to separate the first single-object audio track from the original audio track in the multimedia file, and the original audio track is composed of at least the first audio track.
- the single-object audio track and the second single-object audio track are synthesized and obtained, and the second single-object audio track corresponds to the second sounding object.
- the above-mentioned music features include: at least one of music structure, music emotion, and singing mode.
- the above-mentioned obtaining unit is further configured to separate a second single-object soundtrack from the multimedia file; determine the second sound source position of the second sounding object, based on the The second sound source position spatially renders the second single-object audio track to obtain the rendered second single-object audio track.
- a fifth aspect of the present application provides a rendering device, the rendering device executes the foregoing first aspect or the method in any possible implementation manner of the first aspect, or executes the foregoing second aspect or any possible implementation manner of the second aspect method in .
- a sixth aspect of the present application provides a rendering device, including: a processor, where the processor is coupled to a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the rendering device implements the above-mentioned first aspect Or the method in any possible implementation manner of the first aspect, or cause the rendering device to implement the second aspect or the method in any possible implementation manner of the second aspect.
- a seventh aspect of the present application provides a computer-readable medium on which a computer program or instruction is stored, when the computer program or instruction is executed on a computer, the computer can execute the first aspect or any possible implementation of the first aspect. method, or causing a computer to perform the method in the foregoing second aspect or any possible implementation manner of the second aspect.
- An eighth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the first aspect or any possible implementation manner of the first aspect, the second aspect or any of the second aspect. methods in possible implementations.
- the embodiments of the present application have the following advantages: obtaining a first single-object audio track based on a multimedia file, where the first single-object audio track corresponds to the first sounding object; determining the first soundtrack of the first sounding object based on reference information
- the sound source position based on the first sound source position, spatially renders the first single-object audio track to obtain the rendered first single-object audio track.
- the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects.
- FIG. 1 is a schematic structural diagram of a system architecture provided by the application.
- FIG. 2 is a schematic structural diagram of a convolutional neural network provided by the application.
- FIG. 3 is a schematic diagram of another convolutional neural network structure provided by the application.
- FIG. 4 is a schematic diagram of a chip hardware structure provided by the application.
- FIG. 5 is a schematic flowchart of a method for training a separation network provided by the application
- FIG. 6 is a schematic structural diagram of a separation network provided by the application.
- FIG. 7 is a schematic structural diagram of another separation network provided by the application.
- FIG. 9 is a schematic diagram of an application scenario provided by the present application.
- FIG. 10 is a schematic flowchart of a rendering method provided by the application.
- FIG. 11 is a schematic flowchart of a playback device calibration method provided by the application.
- FIG. 18 is a schematic diagram of the orientation of a mobile phone provided by the application.
- FIG. 19 is another schematic diagram of the display interface of the rendering device provided by the application.
- 20 is a schematic diagram of determining the position of a sound source using a mobile phone provided by the present application.
- 21-47 are other schematic diagrams of the display interface of the rendering device provided by the present application.
- FIG. 48 is a schematic structural diagram of the external device system provided by the application in a spherical coordinate system
- 49-50 are several schematic diagrams of sharing rendering rules between users provided by the present application.
- 51-53 are other schematic diagrams of the display interface of the rendering device provided by the present application.
- FIG. 54 is a schematic diagram of user interaction under the sound hunter game scene provided by this application.
- 55-57 are several schematic diagrams of user interaction in a multi-person interaction scenario provided by the present application.
- 58-61 are schematic diagrams of several structures of the rendering device provided by this application.
- FIG. 62 is a schematic structural diagram of the sensor device provided by the application.
- the embodiment of the present application provides a rendering method, which can improve the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file, and provide users with immersive stereoscopic sound effects.
- a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit can be:
- W s is the weight of X s
- b is the bias of the neural unit.
- f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
- the activation function can be a sigmoid function.
- a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
- a deep neural network also known as a multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Of course, the deep neural network may also not include hidden layers, which is not limited here.
- the work of each layer in a deep neural network can be expressed mathematically To describe: from the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column through five operations on the input space (set of input vectors). Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by done, the operation of 4 consists of Completed, the operation of 5 is implemented by ⁇ ().
- W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
- This vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
- the purpose of training the deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.
- Convolutional neural network is a deep neural network with a convolutional structure.
- a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
- the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or a convolutional feature map.
- the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
- a neuron can only be connected to some of its neighbors.
- a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
- Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
- Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
- multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can acquire reasonable weights by learning during the training process of the convolutional neural network.
- the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- the separation network, the identification network, the detection network, the depth estimation network and other networks in the embodiments of the present application may all be CNNs.
- a recurrent neural network refers to a sequence where the current output is also related to the previous output. The specific manifestation is that the network will memorize the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.
- HRTF Head related transfer function
- An audio track is a track for recording audio data, and each audio track has one or more attribute parameters, and the attribute parameters include audio format, bit rate, dubbing language, sound effect, number of channels, volume and so on.
- Tracks can be single or multi-track (or called mixed tracks).
- a single audio track may correspond to one or more sounding objects, and a multi-audio track includes at least two single audio tracks.
- a single-object track corresponds to one sounding object.
- short-time fourier transform short-time fourier transform
- STFT short-term fourier transform
- an embodiment of the present invention provides a system architecture 100 .
- the data collection device 160 is used to collect training data.
- the training data includes: a multimedia file, where the multimedia file includes an original audio track, and the original audio track corresponds to at least one sounding object.
- the training data is stored in the database 130 , and the training device 120 trains and obtains the target model/rule 101 based on the training data maintained in the database 130 .
- Embodiment 1 will be used to describe how the training device 120 obtains the target model/rule 101 based on the training data in more detail below.
- the target model/rule 101 can be used to implement the rendering method provided by this embodiment of the present application, wherein the target model/rule 101 There are multiple situations.
- a case of the target model/rule 101 (when the target model/rule 101 is the first model), the multimedia file is input into the target model/rule 101, and the first single object sound corresponding to the first vocal object can be obtained rail.
- the target model/rule 101 when the target model/rule 101 is the second model, the multimedia file is input into the target model/rule 101 after relevant preprocessing, and the corresponding object corresponding to the first sound can be obtained. 's first single-object track.
- the target model/rule 101 in this embodiment of the present application may specifically include a separation network, and may further include a recognition network, a detection network, a depth estimation network, and the like, which are not specifically limited here.
- the separation network is obtained by training training data.
- the training data maintained in the database 130 may not necessarily all come from the collection of the data collection device 160, and may also be received and acquired from other devices.
- the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
- the target model/rule 101 obtained by training according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
- the execution device 110 is configured with an I/O interface 112, which is used for data interaction with external devices.
- the user can input data to the I/O interface 112 through the client device 140, and the input data is described in the embodiments of the present application. It can include: multimedia files, which can be input by the user, or uploaded by the user through an audio device, and of course can also come from a database, which is not limited here.
- the preprocessing module 113 is configured to perform preprocessing according to the multimedia file received by the I/O interface 112.
- the preprocessing module 113 may be configured to perform short-time Fourier transform processing on the audio track in the multimedia file. , the acquisition time spectrum.
- the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , and the data, instructions, etc. obtained by the corresponding processing may also be stored in the data storage system 150 .
- the I/O interface 112 returns the processing result, such as the first single-object audio track corresponding to the first sounding object obtained above, to the client device 140, so as to be provided to the user.
- the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
- the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
- the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
- the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
- the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
- the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
- FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
- the target model/rule 101 is obtained by training according to the training device 120, and the target model/rule 101 may be a separate network in this embodiment of the present application.
- the separate network It can be a convolutional neural network or a recurrent neural network.
- CNN is a very common neural network
- a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
- CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
- a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .
- the convolutional/pooling layer 120 may include layers 121-126 as examples.
- layer 121 is a convolutional layer
- layer 122 is a pooling layer
- layer 123 is a convolutional layer
- layer 124 is a convolutional layer.
- Layers are pooling layers
- 125 are convolutional layers
- 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
- the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
- the convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels...depending on the value of stride), which completes the work of extracting specific features from the image.
- the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
- the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied.
- the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
- Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification...
- the dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
- weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
- the initial convolutional layer for example, 121
- the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
- each layer 121-126 exemplified by 120 in Figure 2 can be a convolutional layer followed by a layer
- the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
- the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
- the average pooling operator can calculate the average value of the pixel values in the image within a certain range.
- the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
- the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
- the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types For example, the task type can include multi-track separation, image recognition, image classification, image super-resolution reconstruction and so on.
- the task type can include multi-track separation, image recognition, image classification, image super-resolution reconstruction and so on.
- the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
- the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 2, the propagation from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 2 from 140 to 110 as the back propagation) will start to update.
- the weight values and deviations of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
- the convolutional neural network 100 shown in FIG. 2 is only used as an example of a convolutional neural network.
- the convolutional neural network may also exist in the form of other network models, for example, such as
- the multiple convolutional layers/pooling layers shown in FIG. 3 are in parallel, and the extracted features are input to the full neural network layer 130 for processing.
- FIG. 4 is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor 40 .
- the chip can be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111 .
- the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
- the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.
- the neural network processor 40 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
- NPU neural-network processing unit
- TPU tensor processing unit
- GPU graphics processor
- NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and tasks are allocated by the main CPU.
- the core part of the NPU is the operation circuit 403, and the controller 404 controls the operation circuit 403 to extract the data in the memory (weight memory or input memory) and perform operations.
- the arithmetic circuit 403 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 403 is a general-purpose matrix processor.
- the operation circuit fetches the data corresponding to the matrix B from the weight memory 402 and buffers it on each PE in the operation circuit.
- the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 401 to perform matrix operation, and stores the partial result or final result of the obtained matrix in the accumulator 408 .
- the vector calculation unit 407 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like.
- the vector calculation unit 407 can be used for network calculation of non-convolutional/non-FC layers in the neural network, such as pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), etc. .
- the vector computation unit 407 can store the processed output vectors to the unified buffer 406 .
- the vector calculation unit 407 may apply a nonlinear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate activation values.
- vector computation unit 407 generates normalized values, merged values, or both.
- the vector of processed outputs can be used as an activation input to the arithmetic circuit 403, such as for use in subsequent layers in a neural network.
- Unified memory 406 is used to store input data and output data.
- the weight data directly transfers the input data in the external memory to the input memory 401 and/or the unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402, And the data in the unified memory 506 is stored in the external memory.
- DMAC direct memory access controller
- the bus interface unit (bus interface unit, BIU) 410 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 409 through the bus.
- An instruction fetch buffer 409 connected to the controller 404 is used to store the instructions used by the controller 404.
- the controller 404 is used for invoking the instructions cached in the memory 409 to control the working process of the operation accelerator.
- the unified memory 406, the input memory 401, the weight memory 402 and the instruction fetch memory 409 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
- DDR SDRAM double data rate synchronous dynamic random access Memory
- HBM high bandwidth memory
- each layer in the convolutional neural network shown in FIG. 2 or FIG. 3 may be performed by the operation circuit 403 or the vector calculation unit 407 .
- the training method of the separation network is introduced in detail with reference to FIG. 5 .
- the method shown in FIG. 5 can be performed by a training device of a separate network, which can be a cloud service device or a terminal device, for example, a computer, a server, etc., whose computing power is sufficient to perform the training of the separate network
- the apparatus of the method may also be a system composed of a cloud service device and a terminal device.
- the training method may be performed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .
- the training method may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computation may be used without using the GPU, which is not limited in this application.
- a separation network can be used to separate the original audio track to obtain at least one single-object audio track.
- the original audio track in the multimedia file corresponds to only one sounding object, the original audio track is a single-object audio track, and a separation network does not need to be used for separation.
- the training method may include steps 501 and 502 . Steps 501 and 502 are described in detail below.
- Step 501 Acquire training data.
- the training data in the embodiment of the present application is obtained by synthesizing at least the initial third single-object audio track and the initial fourth single-object audio track, and it can also be understood that the training data includes a multi-tone synthesized by the single-object audio tracks corresponding to at least two sounding objects rail.
- the initial third single-object audio track corresponds to the third vocal object
- the initial fourth single-object audio track corresponds to the fourth vocal object.
- the training data may also include images matching the original audio track, wherein the training data may also be A multimedia file, the multimedia file includes the above-mentioned multiple audio tracks, and the multimedia file may include a video track or a text track (or a bullet screen track) in addition to the audio track, which is not specifically limited here.
- the audio tracks (original audio tracks, first single-object audio tracks, etc.) in the embodiments of the present application may include vocal tracks, musical instrument tracks (eg, drum tracks, piano tracks, trumpet tracks, etc.), airplane sounds, etc.
- the audio track generated by the object is not limited here to the specific sounding object corresponding to the audio track.
- the training data may be obtained by directly recording the sound of the sounding object, or by the user inputting audio information and video information, or by receiving the transmission from the acquisition device.
- the specific way of obtaining training data is not limited here.
- Step 502 take the training data as the input of the separation network, train the separation network with the value of the loss function less than the first threshold as the target, and obtain the trained separation network.
- the separation network in the embodiments of the present application may be called a separation neural network, a separation model, a separation neural network model, or the like, which is not specifically limited here.
- the loss function is used to indicate the difference between the separately obtained third single-object audio track and the original third single-object audio track.
- the separation network is trained with the goal of reducing the value of the loss function, that is, the difference between the third single-object audio track output by the separation network and the initial third single-object audio track is continuously reduced.
- This training process can be understood as a separation task.
- the loss function can be understood as the loss function corresponding to the separation task.
- the output (at least one single-object audio track) of the separation network is a single-object audio track corresponding to at least one sound-emitting object in the input (audio track).
- the third sound-emitting object belongs to the same type as the first sound-emitting object, and the second sound-emitting object belongs to the same type as the fourth sound-emitting object.
- the first vocal object and the third vocal object are both human voices, but the first vocal object may be user A, and the second vocal object may be user B.
- the third single-object audio track and the first single-object audio track are audio tracks corresponding to sounds uttered by different people.
- the third sound-emitting object and the first sound-emitting object in this embodiment of the present application may be two sound-emitting objects of the same type, or may be one sound-emitting object of the same type, which is not specifically limited here.
- the training data input by the separation network includes the original audio tracks corresponding to at least two sounding objects, and the separation network can output a single-object soundtrack corresponding to a certain sounding object among the at least two sounding objects, and can also output at least two sounding objects.
- Objects respectively correspond to single-object tracks.
- the multimedia file includes a sound track corresponding to a human voice, a sound track corresponding to a piano, and a sound track corresponding to a car sound.
- a single-object audio track for example, a single-object audio track corresponding to a human voice
- two single-object audio tracks for example, a single-object audio track corresponding to a human voice and a car sound corresponding to the voice
- single-object track or three single-object tracks.
- the separation network is shown in Figure 6, and the separation network includes a one-dimensional convolution and a residual structure. Among them, adding a residual structure can improve the gradient transfer efficiency. Of course, splitting the network may also include activation. pooling, etc.
- the specific structure of the separation network is not limited here.
- the separation network shown in Figure 6 takes the signal source (that is, the signal corresponding to the audio track in the multimedia file) as the input, transforms it through multiple convolutions and deconvolutions, and outputs the object signal (a single audio track corresponding to a sounding object) .
- the time series correlation can also be improved by adding a recurrent neural network module, and the connection between high and low dimensional features can be improved by connecting different output layers.
- the separation network is shown in Figure 7.
- the signal source can be preprocessed, for example, STFT mapping is performed on the signal source to obtain the time spectrum.
- the amplitude spectrum in the time spectrum is transformed through two-dimensional convolution and deconvolution to obtain a masked spectrum (the screened spectrum), and the masked spectrum and the amplitude spectrum are combined to obtain the target amplitude spectrum.
- iSTFT inverse short-time Fourier transform
- the connection between high and low-dimensional features can also be improved by connecting different output layers
- the gradient transfer efficiency can be improved by adding a residual structure
- the time series correlation can be improved by adding a recurrent neural network module.
- the input in FIG. 6 can also be understood as a one-dimensional time domain signal, and the input in FIG. 7 is a two-dimensional time spectrum signal.
- the above two separation models are just examples, and in practical applications, there are other possible structures.
- the input of the separation model can be a time-domain signal
- the output can be a time-domain signal
- the input of the separation model can be a time-frequency domain signal
- the output is a time-frequency domain signal, etc.
- the structure, input or output of the separation model are not detailed here. limited.
- the multi-track in the multimedia file can also be identified through the identification network, and it is recognized that the multi-track includes the number of tracks and object categories (for example: vocals, drum sounds, etc.). ), which can reduce the training time of the separation network.
- the separation network may also include an identification sub-network for identifying multiple audio tracks, which is not specifically limited here.
- the input of the recognition network can be a time domain signal
- the output is a class probability. It is equivalent to inputting the time domain signal into the recognition network, obtaining the probability that the object is a certain category, and selecting the category whose probability exceeds the threshold as the category of classification.
- the object here can also be understood as a sounding object.
- the input in the above-mentioned identification network is a multimedia file synthesized by audios corresponding to vehicle A and vehicle B, and the multimedia file is input into the identification network, and the identification network can output the category of vehicle.
- the recognition network can also identify the type of specific car, which is equivalent to further fine-grained recognition.
- the identification network is set according to actual needs, which is not limited here.
- the training process may not adopt the aforementioned training method but adopt other training methods, which is not limited here.
- the system architecture includes input module, function module, database module and output module. Each module is described in detail below:
- the input module includes a database option sub-module, a sensor information acquisition sub-module, a user interface input sub-module and a file input sub-module.
- the above four sub-modules can also be understood as four ways of input.
- the database options submodule is used for spatial rendering according to the rendering method stored in the database selected by the user.
- the sensor information acquisition sub-module is used to specify the spatial position of a specific sound-emitting object according to the sensor (which may be a sensor in the rendering device, or another sensor device, which is not limited here). Select the position of a specific sounding object.
- the user interface input sub-module is used to determine the spatial position of the specific sounding object in response to the user's operation on the user interface.
- the user can control the spatial position of the specific sounding object by clicking, dragging and so on.
- the file input sub-module is used to track the specific sounding object according to the image information or text information (for example: lyrics, subtitles, etc.), and then determine the spatial position of the specific sounding object according to the tracked position of the specific sounding object.
- image information or text information for example: lyrics, subtitles, etc.
- the functional modules include a signal transmission submodule, an object identification submodule, a calibration submodule, an object tracking submodule, an orientation calculation submodule, an object separation submodule, and a rendering submodule.
- the signal transmission sub-module is used for receiving and sending information. Specifically, it may receive input information from the input module, and output feedback information to other modules.
- the feedback information includes information such as position transformation information of a specific sounding object, a separated single-object audio track, and the like.
- the signal transmission submodule may also be used to feed back the identified object information to the user through a user interface (user interface, UI), etc., which is not specifically limited here.
- the object recognition sub-module is used to identify all the object information of the multi-track information sent by the input module to the signal transmission sub-module.
- the object here refers to the sounding object (or called the sounding object), such as human voice, drum sound, and aircraft sound. Wait.
- the object identification sub-module may be the identification network described in the embodiment shown in FIG. 5 or the identification sub-network in the separation network.
- the calibration sub-module is used to calibrate the initial state of the playback device.
- the calibration sub-module is used for earphone calibration
- the calibration sub-module is used for external device calibration.
- the initial state of the sensor (the relationship between the sensor device and the playback device will be described in Figure 9) can be defaulted to be the front, and subsequent corrections are made through the front. It is also possible to obtain the real position where the user visits the sensor to ensure that the front of the sound image is directly in front of the earphone.
- the object tracking sub-module is used to track the motion trajectory of a specific sounding object.
- the specific sounding object may be a sounding object in a text or image displayed in a multimodal file (eg, audio information, video information corresponding to the audio information, audio information, and text information corresponding to the audio information, etc.).
- the object tracking sub-module may also be used to render motion trajectories on the audio side.
- the object tracking sub-module may further include a target recognition network and a depth estimation network, the target recognition network is used to identify a specific vocal object to be tracked, and the depth estimation network is used to obtain the relative coordinates of the specific vocal object in the image (subsequent implementation). This will be described in detail in the example), so that the object tracking sub-module renders the azimuth and motion trajectory of the audio corresponding to the specific sounding object according to the relative coordinates.
- the orientation calculation sub-module is used to convert the information obtained by the input module (for example: sensor information, input information of the UI interface, file information, etc.) into orientation information (also referred to as sound source position).
- orientation information also referred to as sound source position.
- the object separation sub-module is used to separate the multimedia file (or called multimedia information) or multi-track information into at least one single-object track. For example: extracting individual vocal tracks (i.e. vocal-only audio files) from a song.
- the object separation sub-module may be the separation network in the embodiment shown in FIG. 5 above. Further, the structure of the object separation sub-module may be as shown in FIG. 6 or FIG. 7 , which is not specifically limited here.
- the rendering submodule is used to obtain the sound source position obtained by the bearing calculation submodule, and perform spatial rendering on the sound source position. Further, the corresponding rendering method may be determined according to the playback device selected by the input information of the UI in the input module. Different playback devices have different rendering methods, and the rendering process will be described in detail in subsequent embodiments.
- the database module includes a database selection submodule, a rendering rule editing submodule, and a rendering rule sharing submodule.
- the database selection submodule is used to store rendering rules.
- the rendering rule may be a rendering rule for converting a default two-channel/multi-channel audio track into a three-dimensional (three dimensional, 3D) spatial sense when the system is initialized, or a rendering rule saved by the user.
- different objects may correspond to the same rendering rule, or different objects may correspond to different rendering rules.
- the rendering rule editing submodule is used to re-edit the saved rendering rules.
- the saved rendering rule may be a rendering rule stored in the database selection sub-module, or may be a newly input rendering rule, which is not specifically limited here.
- the rendering rules sharing submodule is used to upload rendering rules to the cloud, and/or to download specific rendering rules from the rendering rules database in the cloud.
- the rendering rule sharing module can upload user-defined rendering rules to the cloud and share them with other users. Users can select the rendering rules shared by other users that match the multi-track information to be played from the rendering rules database stored in the cloud, and download them to the database on the terminal side as the data files for the audio 3D rendering rules.
- the output module is used to play the rendered single-object audio track or the target audio track (obtained from the original audio track and the rendered single-object audio track) through the playback device.
- the application scenario includes a control device 901 , a sensor device 902 and a playback device 903 .
- the playback device 903 in this embodiment of the present application may be an external device, an earphone (such as an in-ear earphone, a headphone, etc.), or a large screen (such as a projection screen), etc., which is not specifically limited here. .
- connection between the control device 901 and the sensor device 902, and between the sensor device 902 and the playback device 903 can be connected through wired, wireless fidelity (WIFI), mobile data network or other connection methods, which is not specifically done here. limited.
- WIFI wireless fidelity
- mobile data network or other connection methods, which is not specifically done here. limited.
- the control device 901 in this embodiment of the present application is a terminal device used to serve users, and the terminal device may include a head mount display (HMD), which may be a virtual reality (VR) ) combination of box and terminal, VR all-in-one machine, personal computer (PC) VR, augmented reality (AR) device, mixed reality (mixed reality, MR) device, etc.
- the terminal device may also include a cellular phone (cellular phone), smart phone (smart phone), personal digital assistant (personal digital assistant, PDA), tablet computer, laptop computer (laptop computer), personal computer (personal computer, PC), vehicle terminal, etc.
- cellular phone cellular phone
- smart phone smart phone
- PDA personal digital assistant
- tablet computer laptop computer
- laptop computer laptop computer
- personal computer personal computer, PC
- vehicle terminal etc.
- the sensor device 902 in this embodiment of the present application is a device for sensing orientation and/or position, and may be a laser pointer, a mobile phone, a smart watch, a smart bracelet, or a device with an inertial measurement unit (IMU). , devices with simultaneous localization and mapping (simultaneous localization and mapping, SLAM) sensors, etc., which are not limited here.
- IMU inertial measurement unit
- the playback device 903 in this embodiment of the present application is a device used to play audio or video, and may be an external playback device (for example, a sound box, a terminal device with the function of playing audio or video), or an internal playback device (for example, : in-ear headphones, headsets, AR devices, VR devices, etc.), etc., which are not limited here.
- an external playback device for example, a sound box, a terminal device with the function of playing audio or video
- an internal playback device for example, : in-ear headphones, headsets, AR devices, VR devices, etc.
- each device in the application scenario shown in FIG. 9 may be one or more, for example, there may be multiple external devices, and the number of each device is not specifically limited here.
- control device, sensor device, and playback device in the embodiments of the present application may be three devices, two devices, or one device, which is not specifically limited here.
- control device and the sensor device in the application scenario shown in FIG. 9 are the same device.
- the control device and the sensor device are the same mobile phone, and the playback device is the headset.
- the control device and the sensor device are the same mobile phone, and the playback device is an external device (also called an external device system, and the external device system includes one or more external devices).
- control device and the playback device in the application scenario shown in FIG. 9 are the same device.
- the control device and the playback device are the same computer.
- the control device and the playback device are the same large screen.
- control device, the sensor device, and the playback device in the application scenario shown in FIG. 9 are the same device.
- the control device, sensor device, and playback device are the same tablet.
- an embodiment of the rendering method provided by the embodiment of the present application may be executed by a rendering device, or may be executed by a component of the rendering device (for example, a processor, a chip, or a chip system, etc.).
- the embodiment includes: Steps 1001 to 1004.
- the rendering device may have the function of the control device, the function of the sensor device, and/or the function of the playback device as shown in FIG. 9 , which is not specifically limited here.
- the rendering method is described below by taking the rendering device as a control device (such as a notebook), the sensor device as a device with an IMU (such as a mobile phone), and the playback device as an external device (such as a speaker).
- the sensor described in the embodiments of the present application may refer to a sensor in a rendering device, or may refer to a sensor in a device other than the rendering device (eg, the aforementioned sensor device), which is not specifically limited here.
- Step 1001 calibrating the playback device. This step is optional.
- the playback device may be calibrated before the playback device plays the rendered audio track, and the purpose of the calibration is to improve the authenticity of the spatial effect of the rendered audio track.
- a method for calibrating a playback device includes steps 1 to 5 .
- the mobile phone held by the user establishes a connection with the external playback device.
- the connection method is similar to the connection between the sensor device and the playback device in the embodiment shown in FIG. 9 , and details are not described here.
- Step 1 Determine the playback device type.
- the rendering device may determine the playback device type through a user's operation, may adaptively detect the playback device type, may also determine the playback device type by default settings, or may determine the playback device type in other ways, specifically here Not limited.
- the rendering device may display an interface as shown in FIG. 12 , where the interface includes an icon for selecting a playback device type.
- the interface may also include an icon for selecting an input file, an icon for selecting a rendering method (ie, reference information option), a calibration icon, a sound hunter icon, an object bar, volume, time progress, and a sphere view (or 3D view).
- the user can click on the “Select Playback Device Type Icon” 101 .
- FIG. 12 the interface includes an icon for selecting a playback device type.
- the interface may also include an icon for selecting an input file, an icon for selecting a rendering method (ie, reference information option), a calibration icon, a sound hunter icon, an object bar, volume, time progress, and a sphere view (or 3D view).
- the rendering device displays a drop-down menu, and the drop-down menu may include “external device options” and “headphone options”. Further, the user can click on the "external playback device option" 102 to further determine that the type of the playback device is an external playback device. As shown in FIG. 15 , in the interface displayed by the rendering device, “select the playback device type” may be replaced by “external playback device” to prompt the user that the current playback device type is the external playback device. It can also be understood that the rendering device displays the interface shown in FIG. 12 , the rendering device receives the fifth operation of the user (ie the click operation as shown in FIG. 13 and FIG. 14 ), and the rendering device responds to the fifth operation and sends the data from the playback device. In the type option, select the playback device type as external device.
- this method is for calibrating the playback device, as shown in FIG. 16, the user can also click on the “calibration icon” 103, as shown in FIG. 17, the rendering device responds to the click operation and displays a drop-down menu, which may include “ Default Options" and "Manual Calibration Options". Further, the user can click the "manual calibration option" 104, and then determine that the calibration method is automatic calibration, and the automatic calibration can be understood as the user calibrating the playback device using a mobile phone (ie, a sensor device).
- a mobile phone ie, a sensor device
- the drop-down menu of "Select Playback Device Type Icon” includes "External Device Options" and "Headphone Options" as an example.
- the drop-down menu may also include specific types of headphones, such as headset options. Options such as earphones, in-ear earphones, wired earphones, and Bluetooth earphones are not limited here.
- the pull-down menu of “calibration icon” includes “default options” and “manual calibration options” as an example. In practical applications, the pull-down menu may also include other types of options, which are not limited here.
- Step 2. Determine the test audio.
- the test audio in this embodiment of the present application may be a test signal set by default (for example, pink noise), or it may be a person separated from a song (that is, a multimedia file is a song) through the separation network in the above-mentioned embodiment shown in FIG. 5
- the single-object audio track corresponding to the sound may also be audio corresponding to other single-object audio tracks in the song, or may be audio including only the single-object audio track, etc., which is not specifically limited here.
- the user may click the "select input file icon" on the interface displayed by the rendering device to select the test audio.
- Step 3 Obtain the attitude angle of the mobile phone and the distance between the sensor and the external device.
- the external device plays the test audio in sequence, and the user holds the sensor device (eg, mobile phone) to point to the external device that is playing the test audio.
- the mobile phone After the mobile phone is stabilized, record the current orientation of the mobile phone and the received signal energy of the test audio, and calculate the distance between the mobile phone and the external device according to the following formula 1.
- the stability of the mobile phone placement can be understood as within a period of time (for example, 200 milliseconds), the variance of the mobile phone orientation is less than a threshold (for example, 5 degrees).
- the first external playback device plays the test audio first, and the user holds the mobile phone and points to the first external playback device. After the calibration of the first external device is completed, the user holds the mobile phone and points to the second external device for calibration.
- the orientation of the mobile phone in the embodiment of the present application may refer to the attitude angle of the mobile phone, and the attitude angle may include an azimuth angle and a tilt angle (or a tilt angle), or the attitude angle includes an azimuth angle, a tilt angle, and a pitch angle.
- the azimuth angle represents the angle around the z-axis
- the tilt angle represents the angle around the y-axis
- the pitch angle represents the angle around the x-axis.
- the relationship between the orientation of the mobile phone and the x-axis, y-axis, and z-axis can be shown in Figure 18.
- the playback devices are two external playback devices, the first external playback device plays the test audio first, the user holds the mobile phone and points to the first external playback device, records the current orientation of the mobile phone and receives the test audio. The signal energy of the audio.
- the second external playback device plays the test audio, and the user holds the mobile phone and points to the second speaker to record the current orientation of the mobile phone and the signal energy of the test audio received.
- the rendering device can display the interface shown in FIG. 19, wherein the right side of the interface is a spherical view, and the external device that has been calibrated and the external device that is being calibrated can be displayed in the spherical view. put the device.
- an uncalibrated external device (not shown in the figure) may also be displayed, which is not specifically limited here.
- the center of the spherical view is the position of the user (it can also be understood as the position where the user holds the mobile phone.
- the position of the mobile phone is similar to the position of the user), and the radius can be the position of the user (or the position of the mobile phone) and the external
- the distance of the device may also be a default value (for example, 1 meter), etc., which is not specifically limited here.
- FIG. 20 an example effect diagram of a user holding a mobile phone and facing an external device.
- the number of external playback devices is N, and N is a positive integer.
- the i-th external playback device refers to a certain external playback device among the N external playback devices, where i is a positive integer and i ⁇ N .
- the formulas in the embodiments of the present application are all calculated by taking the i-th external device as an example, and the calculation of other external devices is similar to the calculation of the i-th external device.
- Formula 1 used to calibrate the i-th external device can be described as follows:
- x(t) is the energy of the test signal received by the mobile phone at time t
- X(t) is the energy of the test signal played by the external device at time t
- t is a positive number
- ri is the mobile phone and the i -th external device.
- the distance between devices since the user holds the mobile phone, it can also be understood as the distance between the user and the i-th external device
- rs is the normalized distance, which can be understood as a coefficient , which is used to convert the ratio of x(t) to X(t) into distance.
- the coefficient can be set according to the actual external device.
- the specific value of rs is not limited here.
- test signals are played in sequence and directed toward the external playback devices, and the distance is obtained by formula 1.
- Step 4 Determine the position information of the external device based on the attitude angle and the distance.
- step 3 the mobile phone has recorded the attitude angle of the mobile phone towards each external device and calculated the distance between the mobile phone and each external device by formula 1.
- the mobile phone can also send the measured sum to the rendering device, and the rendering device calculates the distance between the mobile phone and each external device through formula 1, which is not specifically limited here.
- the rendering device After the rendering device obtains the attitude angle of the mobile phone and the distance between the mobile phone and the external device, the attitude angle of the mobile phone and the distance between the mobile phone and the external device can be converted into the position of the external device in the spherical coordinate system by formula 2 information, the location information includes azimuth, tilt, and distance (ie, the distance between the sensor device and the playback device).
- the location information includes azimuth, tilt, and distance (ie, the distance between the sensor device and the playback device).
- ⁇ (t) is the azimuth angle of the ith external device in the spherical coordinate system at time t
- ⁇ (t) is the inclination angle of the ith external device in the spherical coordinate system at time t
- d( t) is the distance between the mobile phone and the i-th external device
- ⁇ (t)[0] is the azimuth angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the z-axis)
- ⁇ (t)[1] is The pitch angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the x-axis)
- ri is the distance calculated by formula 1, sign represents positive and negative values, if ⁇ (t)[1] is positive, then sign is positive ;If ⁇ (t)[1] is negative, sign is negative; %360 is used to adjust the angle range to 0°-360°, for example: if the angle of ⁇ (t)
- the rendering device may display an interface as shown in FIG. 21 , the interface displays a “calibrated icon”, and the position of the calibrated playback device may be displayed in the right spherical view.
- the problem of calibrating irregular external playback devices can be solved, allowing the user to obtain the spatial positioning of each external playback device in subsequent operations, so as to accurately render the position required for the single-object audio track, and improve the rendering audio track. the authenticity of the spatial effect.
- Step 1002 Acquire a first single-object audio track based on the multimedia file.
- the rendering device obtains the multimedia file by directly recording the sound of the first sound-emitting object, or it can be obtained by sending it by other devices, for example, by receiving a collection device (for example, a camera, a tape recorder, a mobile phone, etc. ), etc.
- a collection device for example, a camera, a tape recorder, a mobile phone, etc.
- the specific obtaining methods of multimedia files are not limited here.
- the multimedia file in this embodiment of the present application may specifically be audio information, such as stereo audio information, multi-channel audio information, and the like.
- the multimedia file may also be multi-modal information, for example, the multi-modal information is video information, image information corresponding to audio information, text information, and the like.
- the multimedia file may include, in addition to the audio track, a video track or a text track (or called a bullet screen track), etc., which is not specifically limited here.
- the multimedia file may include a first single-object audio track, or an original audio track, where the original audio track is synthesized by at least two single-object audio tracks, which is not specifically limited here.
- the original audio track may be a single audio track or multiple audio tracks, which is not specifically limited here.
- the original audio track can include vocal tracks, musical instrument tracks (for example: drum tracks, piano tracks, trumpet tracks, etc.), airplane sounds, and other audio tracks generated by sound-emitting objects (or called sound-emitting objects).
- the specific type of the sounding object is not limited here.
- the processing method of this step may be different, which are described below:
- the audio track in the multimedia file is a single-object audio track.
- the rendering device can directly acquire the first single-object audio track from the multimedia file.
- the audio track in the multimedia file is a multi-object audio track.
- the original audio track in the multimedia file corresponds to multiple sounding objects.
- the original audio track also corresponds to the second sound-emitting object. That is, the original audio track is obtained by synthesizing at least the first single-object audio track and the second single-object audio track, the first single-object audio track corresponds to the first sounding object, and the second single-object soundtrack corresponds to the second sounding object.
- the rendering device can separate the first single-object audio track from the original audio track, and can also separate the first single-object audio track and the second single-object audio track from the original audio track. limited.
- the rendering device may separate the first single-object audio track from the original audio track through the separation network in the aforementioned embodiment shown in FIG. 5 .
- the rendering device can also separate the first single-object audio track and the second single-object audio track from the original audio track through the separation network, which is not limited here. Reference may be made to the description in the embodiment shown in FIG. 5 , which will not be repeated here.
- the rendering device can identify the sound-emitting object of the original audio track in the multimedia file through the identification network or the separation network.
- the sound-emitting object contained in the original sound track includes the first sound-emitting object and the second sound-emitting object.
- the sounding object The rendering device may randomly select one of the sound-emitting objects as the first sound-emitting object, or may determine the first sound-emitting object through the user's selection.
- the rendering device determines the first sound-emitting object, the first single-object audio track can be acquired through the separation network.
- the rendering device may first obtain the sound-emitting object through the identification network, and then obtain the single-object audio track of the sound-emitting object through the separation network.
- the sounding object included in the multimedia file and the single-object audio track corresponding to the sounding object may also be obtained directly through the identification network and/or the separation network, which is not specifically limited here.
- the rendering device may display the interface shown in FIG. 21 or the interface shown in FIG. 22 .
- the user can select a multimedia file by clicking the "select input file icon" 105, and the multimedia file here is "Dream it possible.wav” as an example.
- the rendering device receives the fourth operation of the user, and in response to the fourth operation, the rendering device selects "Dream it possible.wav” (that is, the target file) as the multimedia file from at least one multimedia file stored in the storage area. .
- the storage area may be a storage area in the rendering device, or may be a storage area in an external device (such as a U disk, etc.), which is not specifically limited here.
- the rendering device can display the interface as shown in Figure 23. In this interface, "select input file” can be replaced by "Dream it possible.wav” to prompt the user that the current multimedia file is: Dream it possible.wav.
- the rendering device may use the identification network and/or the separation network in the embodiment shown in FIG. 4 to identify the sounding objects in "Dream it possible.wav” and separate the single-object audio track corresponding to each sounding object.
- the rendering device recognizes that the sound-emitting objects included in "Dream it possible.wav” are people, pianos, violins, and guitars.
- the interface displayed by the rendering device may also include an object bar, and icons such as "voice icon”, “piano icon”, “violin icon”, and “guitar icon” may be displayed in the object bar, for the user to select to be rendered the sounding object.
- icons such as "voice icon”, “piano icon”, “violin icon”, and “guitar icon” may be displayed in the object bar, for the user to select to be rendered the sounding object.
- a “coupling icon” may also be displayed in the object bar, and the user can stop the selection of the sounding object by clicking the "coupling icon”.
- the user can click on the “voice icon” 106 to determine that the audio track to be rendered is a single-object audio track corresponding to a human voice.
- the rendering device recognizes "Dream it possible.wav", obtains the rendering device and displays the interface shown in Figure 24, the rendering device receives the user's click operation, and the rendering device responds to the click operation and selects from the interface.
- the first icon ie, the "voice icon” 106 ), thereby causing the rendering device to determine that the first single-object soundtrack is a human voice.
- the playback device types shown in Figures 22 to 24 are only examples of external playback devices.
- the user can select the type of playback device as headphones, and then only the playback device type selected by the user in the calibration is external playback device. Take the device as an example for schematic illustration.
- the rendering device can also copy one or several single-object audio tracks in the original audio tracks.
- the user can also copy the "Voice Icon" in the object bar to obtain the "Voice 2 icon” ”, the single-object track corresponding to vocal 2 is the same as the single-object track corresponding to vocal.
- the method of copying may be that the user double-clicks the "voice icon”, or double-clicks the human voice on the ball view, which is not specifically limited here.
- the user can copy and obtain the "Voice 2 Icon", which can be the default user to lose the control of the voice and start to control the voice 2.
- the position of the first sound source of the vocal can also be displayed in the spherical view.
- the user can not only copy the vocal object, but also delete the vocal object.
- Step 1003 Determine the position of the first sound source of the first sound-emitting object based on the reference information.
- the sound source position of one sound-emitting object may be determined based on the reference information, and the sound source positions corresponding to the multiple sound-emitting objects may also be determined. Specifically, this There are no restrictions.
- the rendering device determines that the first sounding object is a human voice
- the rendering device can display the interface shown in FIG. 26
- the user can click the “Select Rendering Mode Icon” 107 to select reference information, which is used to determine The position of the first sound source of the first sound-emitting object.
- the rendering device may display a pull-down menu, and the pull-down menu may include "automatic rendering options" and "interactive rendering options".
- the "interactive rendering option” corresponds to the reference position information
- the “automatic rendering option” corresponds to the media information.
- the rendering mode includes an automatic rendering mode and an interactive rendering mode
- the automatic rendering mode means that the rendering device automatically obtains the rendered first single-object audio track according to the media information in the multimedia file.
- the interactive rendering method refers to obtaining the rendered first single-object audio track through the interaction between the user and the rendering device.
- the rendering device can obtain the rendered first single-object audio track based on the preset mode; or when the interactive rendering mode is determined, in response to the user's second operation to obtain reference position information; determine the first sound source position of the first sound-emitting object based on the reference position information; render the first single-object audio track based on the first sound source position to obtain the rendered first single-object audio track.
- the preset method includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to obtain the rendered audio track.
- the first single-object track includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to obtain the rendered audio track.
- the first single-object track includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to obtain the rendered audio track.
- the first single-object track includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to
- the sound source positions in the embodiments of the present application may be fixed positions at a certain moment, or may be multiple positions (for example, motion trajectories) within a certain period of time. , which is not specifically limited here.
- the reference information includes reference location information.
- the reference position information in this embodiment of the present application is used to indicate the position of the sound source of the first sounding object, and the reference position information may be the first position information of the sensor device, or may be the second position information selected by the user, etc. Do limit.
- the reference location information in the embodiments of the present application has various situations, which are described below:
- the reference position information is the first position information of the sensor device (hereinafter referred to as the sensor).
- the user may click the “interactive rendering option” 108 to determine that the rendering mode is interactive rendering.
- the rendering device may display a drop-down menu in response to the click operation, the drop-down menu may include "Orientation Control Options", “Position Control Options”, and "Interface Control Options".
- the first location information in the embodiment of the present application has various situations, which are described below:
- the first position information includes the first attitude angle of the sensor.
- the user can adjust the orientation of the handheld sensor device (such as a mobile phone) through a second operation (such as panning up, down, left, and right) to determine the position of the first sound source of the first single-object track.
- the rendering device can receive the first attitude angle of the mobile phone, and use the following formula 3 to obtain the first sound source position of the first single-object audio track, where the first sound source position includes the azimuth angle, the inclination angle and the outer position. Put the distance between the device and the phone.
- the user can further determine the position of the second sound source of the second single-object audio track by adjusting the orientation of the handheld mobile phone.
- the rendering device can receive the first attitude angle (including the azimuth angle and the inclination angle) of the mobile phone, and use the following formula 3 to obtain the second sound source position of the second single-object audio track, the second sound source position. Including azimuth, tilt, and the distance between the external device and the phone.
- the rendering device may send reminder information to the user, where the reminder information is used to remind the user to connect the mobile phone and the rendering device.
- the mobile phone and the rendering device may also be the same mobile phone, and in this case, no reminder information needs to be sent.
- ⁇ (t) is the azimuth angle of the ith external device in the spherical coordinate system at time t
- ⁇ (t) is the inclination angle of the ith external device in the spherical coordinate system at time t
- d( t) is the distance between the mobile phone and the i-th external device at time t
- ⁇ (t)[0] is the azimuth angle of the mobile phone (that is, the rotation angle of the mobile phone around the z-axis) at time t
- ⁇ (t)[ 1] is the tilt angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the y-axis)
- d(t) is the distance between the mobile phone and the i-th external device at time t, which can be calculated by formula 1 during calibration.
- the rendering device may display a pull-down menu, and the pull-down menu may include “orientation control options”, “position control options” and “interface control options”.
- the user can click the "Orientation Control Option” 109, and then determine that the rendering mode is the orientation control in the interactive rendering.
- the rendering device can display the interface as shown in Figure 29. In this interface, "select rendering mode" can be replaced by "orientation control rendering” to prompt the user that the current rendering mode is the orientation control. At this time, the user can adjust the orientation of the mobile phone.
- the spherical view in the display interface of the rendering device can display a dotted line, which is used to indicate the current orientation of the mobile phone, so that the user can intuitively The orientation of the mobile phone in the spherical view can be seen, thereby facilitating the user to determine the position of the first sound source.
- the orientation of the mobile phone is stable (refer to the above explanation about the stable positioning of the mobile phone, which will not be repeated here)
- the current first attitude angle of the mobile phone is determined. Further, the first sound source position is obtained through the above formula 3.
- the distance between the mobile phone and the external device obtained during calibration can be used as d(t) in the above formula 3.
- the user determines the first sound source position of the first sound-emitting object based on the first attitude angle, or it is understood as the first sound source position of the first single-object audio track.
- the rendering device may display an interface as shown in FIG. 31 , the spherical view of the interface includes the first position information of the sensor corresponding to the position of the first sound source (ie, the first position information of the mobile phone).
- the above example describes an example of determining the position of the first sound source.
- the user can also determine the position of the second sound source of the second single-object audio track.
- the user may determine that the second sounding object is a violin by clicking on the “violin icon” 110 .
- the rendering device monitors the attitude angle of the mobile phone, and determines the position of the second sound source by formula 3.
- the spherical view in the display interface of the rendering device can display the currently determined first sound source position of the first sound-emitting object (person) and the second sound source position of the second sound-emitting object (violin).
- the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation provided by the sensor (ie, the first attitude angle).
- the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
- the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
- the first position information includes the second attitude angle and acceleration of the sensor.
- the user can control the position of the sensor device (eg, mobile phone) through the second operation, so as to determine the position of the first sound source.
- the rendering device can receive the second attitude angle (including the azimuth angle, the inclination angle and the pitch angle) and the acceleration of the mobile phone, and use the following formulas 4 and 5 to obtain the position of the first sound source, the first sound source.
- Location includes azimuth, tilt, and the distance between the external device and the phone. That is, first, the second attitude angle and acceleration of the mobile phone are converted into the coordinates of the mobile phone in the space rectangular coordinate system by formula 4, and then the coordinates of the mobile phone in the space rectangular coordinate system are converted into the mobile phone in the spherical coordinate system by formula 5. Coordinates, that is, the position of the first sound source.
- x(t), y(t), z(t) are the position information of the mobile phone in the space rectangular coordinate system at time t
- g is the acceleration of gravity
- a(t) is the acceleration of the mobile phone at time t
- ⁇ ( t)[0] is the azimuth angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the z-axis)
- ⁇ (t)[1] is the pitch angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the x-axis)
- ⁇ (t)[2] is the tilt angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the y-axis)
- ⁇ (t) is the azimuth angle of the i-th external device at time t
- ⁇ (t) is t
- the inclination angle of the i-th external device at time, d(t) is the distance between the i-th external
- the rendering device displays the interface shown in FIG. 27
- the rendering device may display a drop-down menu that may include "Orientation Control Options," “Position Control Options,” and “Interface Control Options.”
- the user can click “Orientation Control Option” as shown in FIG. 33 , and the user can click “Position Control Option” 111 , and then determine that the rendering mode is the position control in the interactive rendering.
- "select rendering mode" can be replaced by "position control rendering” to prompt the user that the current rendering mode is position control.
- the user can adjust the position of the mobile phone, and determine the second attitude angle and acceleration of the current mobile phone after the position of the mobile phone is stable (refer to the above explanation about the stability of the mobile phone position, which will not be repeated here). Further, the position of the first sound source is obtained through the above formula 4 and formula 5. Further, the user determines the first sound source position of the first sound-emitting object based on the second attitude angle and the acceleration, or understands it as the first sound source position of the first single-object audio track. Further, during the process of adjusting the mobile phone by the user, or after the position of the mobile phone is stable, the rendering device can display the interface shown in FIG.
- the spherical view of the interface includes the first position information of the sensor corresponding to the position of the first sound source. (that is, the first location information of the mobile phone).
- the user can intuitively see the position of the mobile phone in the spherical view, thereby facilitating the user to determine the position of the first sound source. If the interface of the rendering device displays the first position information in the spherical view when the user adjusts the position of the mobile phone, it can be changed in real time according to the position change of the mobile phone.
- the above example describes an example of determining the position of the first sound source.
- the user can also determine the position of the second sound source of the second single-object track, and the method of determining the position of the second sound source is the same as determining the position of the first sound source.
- the source location is similar and will not be repeated here.
- first location information is only examples, and in practical applications, the first location information may also have other situations, which are not specifically limited here.
- the reference location information is the second location information selected by the user.
- the rendering device may provide a spherical view for the user to select the second position information, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the external device.
- the rendering device acquires the second position information selected by the user in the spherical view, and converts the second position information into the position of the first sound source. It can also be understood that the rendering device obtains the second position information of a certain point selected by the user in the spherical view, and converts the second position information of the point into the position of the first sound source.
- the second position information includes the two-dimensional coordinates of the point selected by the user on the tangent plane in the spherical view and the depth (ie, the distance between the tangent plane and the center of the sphere).
- the rendering device displays the interface shown in FIG. 27 , after the user determines that the rendering mode is interactive rendering.
- the rendering device may display a drop-down menu that may include "Orientation Control Options,” “Position Control Options,” and “Interface Control Options.”
- the user can click the “orientation control option” as shown in FIG. 35 , the user can click the “interface control option” 112 , and then determine that the rendering mode is the interface control in the interactive rendering.
- the rendering device can display the interface shown in Figure 36. In this interface, "select rendering mode" can be replaced by "interface control rendering” to prompt the user that the current rendering mode is interface control.
- the second location information in the embodiments of the present application has various situations, which are described below:
- the second position information is obtained according to the user's selection on the vertical section.
- the rendering device obtains the two-dimensional coordinates of the point selected by the user on the vertical section and the distance between the vertical section where the point is located and the center of the circle (hereinafter referred to as the depth), and uses the following formula 6 to convert the two-dimensional coordinates and depth into the position of the first sound source , the position of the first sound source includes the azimuth angle, the inclination angle and the distance between the external device and the mobile phone.
- the right side of the interface of the rendering device can display the spherical view, the vertical slice, and the depth control bar.
- the depth control bar is used to adjust the distance between the vertical section and the center of the sphere.
- the user can click on a point (x, y) on the horizontal plane (shown as 114), and the corresponding spherical view in the upper right corner will display the position of the point in the spherical coordinate system.
- the horizontal slice is jumped by default, the user can click the meridian in the ball view (as shown by 113 in Figure 37), and the interface will display the vertical slice in the interface shown in Figure 37.
- the user can also adjust the distance between the vertical section and the center of the sphere through a sliding operation (as shown by 115 in FIG. 37 ).
- the second position information includes the two-dimensional coordinates (x, y) of the point and the depth r. And use formula 6 to obtain the position of the first sound source.
- x is the abscissa of the point selected by the user in the vertical section
- y is the ordinate of the point selected by the user in the vertical section
- r is the depth
- ⁇ is the azimuth of the i-th external device
- ⁇ is the i-th external device
- d is the distance between the i-th external device and the mobile phone (it can also be understood as the distance between the i-th external device and the user);
- the second position information is obtained according to the user's selection on the horizontal section.
- the rendering device obtains the two-dimensional coordinates of the point selected by the user on the horizontal section and the distance between the horizontal section where the point is located and the center of the circle (hereinafter referred to as the depth), and uses the following formula 7 to convert the two-dimensional coordinates and depth into the position of the first sound source , the position of the first sound source includes the azimuth angle, the inclination angle and the distance between the external device and the mobile phone.
- a spherical view, a horizontal slice, and a depth control bar can be displayed on the right side of the interface of the rendering device.
- the depth control bar is used to adjust the distance between the horizontal slice and the center of the sphere.
- the user can click a point (x, y) on the horizontal plane (as shown in 117), and the corresponding spherical view in the upper right corner will display the position of the point in the spherical coordinate system.
- the second position information includes the two-dimensional coordinates (x, y) of the point and the depth r. And use formula 7 to obtain the position of the first sound source.
- x is the abscissa of the point selected by the user in the vertical section
- y is the ordinate of the point selected by the user in the vertical section
- r is the depth
- ⁇ is the azimuth of the i-th external device
- ⁇ is the i-th external device
- d is the distance between the i-th external device and the mobile phone (it can also be understood as the distance between the i-th external device and the user);
- the above two manners of referring to the location information are just examples, and in practical applications, the reference location information may also have other situations, which are not specifically limited here.
- the user can select the second position information (such as clicking, dragging, sliding and other second operations) through the spherical view to control the selected sound-emitting object and perform real-time or later dynamic rendering, giving it a specific spatial orientation
- users can create interactive creations with audio, providing users with a new experience.
- the reference information includes media information of the multimedia file.
- the media information in the embodiment of the present application includes at least one of the text to be displayed in the multimedia file, the image to be displayed in the multimedia file, the musical feature of the music in the multimedia file, the sound source type corresponding to the first sounding object, and the like, There is no specific limitation here.
- determining the position of the first sound source of the first sound-emitting object based on the music characteristics of the music in the multimedia file or the sound source type corresponding to the first sound-emitting object may be understood as automatic 3D remixing. Determining the position of the first sound source of the first sound-emitting object based on the position text to be displayed in the multimedia file or the image to be displayed in the multimedia file can be understood as multimodal remixing. They are described below:
- the rendering device determines that the first sounding object is a human voice
- the rendering device can display the interface shown in Figure 26, and the user can click the "Select Rendering Mode Icon" 107 to select the rendering mode, which is used to determine the rendering mode.
- the position of the first sound source of the first sound-emitting object As shown in FIG. 39, in response to the click operation, the rendering device may display a drop-down menu, which may include "Automatic Rendering Options" and "Interactive Rendering Options". Among them, the "interactive rendering option" corresponds to the reference position information, and the "automatic rendering option" corresponds to the media information. Further, as shown in FIG. 39 , the user can click “Auto Rendering Options” 119 to determine that the rendering mode is automatic rendering.
- the user may click on “Auto Rendering Options” 119, and the rendering device may, in response to the click operation, display a drop-down menu as shown in FIG. 40, the drop-down menu may include “Auto 3D Remix Options” and "Multimodal Remix Option".
- the user can click the "Auto 3D Remix option” 120, and after the user selects the automatic 3D remix, the rendering device can display the interface shown in FIG. 41, in which the interface can be replaced by "Auto 3D Remix” "Select Rendering Mode” to prompt the user that the current rendering mode is automatic 3D remixing.
- the media information includes the music characteristics of the music in the multimedia file.
- the music feature in the embodiments of the present application may refer to at least one of music structure, music emotion, singing mode, and the like.
- the music structure may include prelude, prelude vocals, verse, transition section, or chorus, etc.; music emotion includes cheerfulness, sadness, or panic, etc.; singing mode includes solo, chorus, or accompaniment, etc.
- the rendering device After the rendering device determines the multimedia file, it can analyze the music features in the audio track (which can also be understood as audio, songs, etc.) of the multimedia file.
- the music feature can also be identified by an artificial method or a neural network method, which is not specifically limited here.
- the position of the first sound source corresponding to the music feature may be determined according to a preset association relationship, where the association relationship is the relationship between the music feature and the position of the first sound source.
- the rendering device determines that the position of the first sound source is surround, and the rendering device can display an interface as shown in FIG.
- the musical structure may generally include at least one of an intro, intro vocals, verse, transition, or chorus.
- the following is a schematic illustration of analyzing the structure of a song as an example.
- the vocal and musical instrument sounds are separated for the song, which can be separated manually or through a neural network, which is not specifically limited here.
- the song can be segmented by judging the mute paragraph of the human voice and the variance of the pitch.
- the specific steps include: if the mute of the human voice is greater than a certain threshold (for example, 2 seconds), it is considered that the paragraph is over, thereby affecting the song's silence. Divide into large paragraphs. If there is no human voice in the first large paragraph, it is determined that the large paragraph is an instrumental intro; And determine the large paragraph with the silence in the middle as the transitional paragraph.
- large vocal paragraphs calculate the center frequency of each large paragraph including vocals (hereinafter referred to as large vocal paragraphs) by the following formula 8, and calculate the variance of the center frequency at all times in the large vocal paragraph, and perform the large vocal paragraphs according to the variance. Sorting, the vocal passages with variance in the first 50% are marked as chorus, and the last 50% vocal passages are marked as main songs.
- the musical characteristics of the song are determined by the fluctuation of the frequency, and then in the subsequent rendering, for different large paragraphs, the position of the sound source or the motion trajectory of the position of the sound source can be determined through the preset association relationship, and then the song's sound source position can be determined. Different large paragraphs are rendered.
- the music feature is a prelude
- the first sound source position is a circle above the user (or understood as surround)
- first down-mixing the multi-channel to mono such as average
- the whole prelude stage Set the whole vocal to make a circle around the head, and the speed of each moment is determined according to the vocal energy (RMS or variance representation), the higher the energy, the faster the rotation speed.
- the characteristic of the music is panic, it is determined that the position of the first sound source is flickering to the right and flickering to the left.
- the music feature is chorus, you can expand and widen the vocals of the left and right channels to increase the delay. Determine the number of instruments in each time period. If there is a solo instrument, let the instrument circle according to the energy in the solo period.
- f c is the center frequency of vocal large paragraphs per 1 second
- N is the number of large paragraphs
- N is a positive integer
- f(n) is the frequency domain obtained by the Fourier transform of the time-domain waveform corresponding to the large paragraph
- x(n) is the energy corresponding to a certain frequency.
- the orientation and dynamics of the extracted specific sound-emitting objects are set according to the musical characteristics of the music, so that our 3D rendering is more natural and the artistry is better reflected.
- the media information includes the sound source type corresponding to the first sounding object.
- the sound source types in the embodiments of the present application may be people, musical instruments, drum sounds, piano sounds, etc. In practical applications, they may be divided according to needs, which is not specifically limited here.
- the rendering device may identify the type of the sound source through an artificial method or a neural network method, which is not specifically limited here.
- the first sound source position corresponding to the sound source type can be determined according to a preset association relationship, which is the relationship between the sound source type and the first sound source position (similar to the aforementioned music features, not here. repeat).
- the user may select a multimedia file by clicking the “Select Input File Icon” 121 , and the multimedia file here is “car.mkv” as an example.
- the rendering device receives the fourth operation of the user, and in response to the fourth operation, the rendering device selects "car.mkv” (ie, the target file) from the storage area as a multimedia file.
- the storage area may be a storage area in the rendering device, or may be a storage area in an external device (such as a U disk, etc.), which is not specifically limited here.
- the rendering device can display the interface shown in FIG.
- “select input file” can be replaced by “car.mkv” to prompt the user that the current multimedia file is: car.mkv.
- the rendering device may use the identification network and/or the separation network in the embodiment shown in FIG. 4 to identify the sounding objects in "car.mkv” and separate the single-object audio track corresponding to each sounding object. For example, the rendering device recognizes that the sound-emitting objects included in "car.mkv” are people, cars, and wind sounds.
- the interface displayed by the rendering device may also include an object bar, and icons such as "voice icon", "car icon”, and "wind sound icon” may be displayed in the object bar, for the user to select the sound to be rendered. object.
- the media information includes images to be displayed in the multimedia file.
- the rendering device obtains the multimedia file (the audio track containing the image or the video)
- the video can be split into frame images (the number can be one or more), and the third position of the first sound-emitting object is obtained based on the frame images. information, and obtain the position of the first sound source based on the third position information, where the third position information includes the two-dimensional coordinates and depth of the first sound-emitting object in the image.
- the specific step of obtaining the position of the first sound source based on the third position information may include: inputting the frame image to the detection network, and obtaining the tracking frame information (x 0 , y ) corresponding to the first sounding object in the frame image. 0 , w 0 , h 0 ), of course, the frame image and the first sounding object can also be used as the input of the detection network, and the detection network outputs the tracking frame information of the first sounding object.
- the tracking frame information includes two-dimensional coordinates (x 0 , y 0 ) of a corner point of the tracking frame, and height h 0 and width w 0 of the tracking frame.
- the rendering device uses formula 9 to calculate the tracking frame information (x 0 , y 0 , w 0 , h 0 ) to obtain the coordinates of the center point of the tracking frame (x c , y c ), and then the coordinates of the center point of the tracking frame (x c , y c ) Input to the depth estimation network to obtain the relative depth of each point in the tracking frame Then use formula ten to calculate the relative depth of each point in the tracking frame Obtain the average depth z c of all points within the tracking box.
- ⁇ i y norm * ⁇ y_max ;
- (x 0 , y 0 ) is the two-dimensional coordinates of a corner point of the tracking frame (for example, the corner point in the lower left corner), h 0 is the height of the tracking frame, w 0 is the width of the tracking frame; h 1 is the image height, w 1 is the width w 1 of the image; is the relative depth of each point in the tracking frame, z c is the average depth of all points in the tracking frame; ⁇ x_max is the playback device (if the playback device is N external devices, the playback device information of the N external devices is the same ) maximum horizontal angle, ⁇ y_max is the maximum vertical angle of the playback device, dy_max is the maximum depth of the playback device; ⁇ i is the azimuth angle of the i-th external device, ⁇ i is the inclination of the i-th external device angle, ri is the distance between the ith external device and the user.
- the user can click on the “multi-modal remix option” 122 , and the rendering device can respond to the click operation to display the interface as shown in FIG. 43 , the right side of the interface includes “car.mkv” A certain frame (for example, the first frame) image and playback device information, where the playback device information includes the maximum horizontal angle, the maximum vertical angle, and the maximum depth.
- the playback device is an earphone, the user can input playback device information.
- the playback device is an external playback device, the user can input the playback device information or directly use the calibration information obtained in the calibration phase as the playback device information, which is not specifically limited here.
- the rendering device may display an interface as shown in Figure 44, in which the "select rendering method" may be replaced by "multi-modal remix” to prompt the user of the current The rendering method is multimodal remixing.
- the media information includes an image to be displayed in the multimedia file
- the media information includes an image to be displayed in the multimedia file
- there are several ways for determining the first sound-emitting object in the image which are described below:
- the first sounding object is determined by the user's click in the object column.
- the rendering device may determine the first sound-emitting object based on the user's click in the object bar.
- the user may determine that the sound-emitting object to be rendered is a car by clicking on the “car icon” 123 .
- the rendering device displays the tracking frame of the car in the image on the right side of "car.mkv", and then obtains the third position information, and converts the third position information into the first sound source position through Formula 9 to Formula 12.
- the interface also includes the corner point coordinates (x 0 , y 0 ) of the lower left corner of the tracking frame and the center point coordinates (x c , y c ).
- the maximum horizontal angle in the external device information is 120 degrees
- the maximum vertical angle is 60 degrees
- the maximum depth is 10 (units may be meters, decimeters, etc., which are not specifically limited here).
- the first sounding object is determined by the user's click on the image.
- the rendering device may use the sound-emitting object determined by the user's third operation (eg, clicking) in the image as the first sound-emitting object.
- the user may determine the first sounding object by clicking on the sounding object (as shown in 124 ) in the image.
- the first sounding object is determined according to the default setting.
- the rendering device may identify the sound-emitting object through the audio track corresponding to the image, may track the default sound-emitting object or all sound-emitting objects in the image, and determine the third position information.
- the third position information includes the two-dimensional coordinates of the sounding object in the image and the depth of the sounding object in the image.
- the rendering device may select "Close” by default in the object column, that is, all sound-emitting objects in the image are tracked, and the third position information of the first sound-emitting object is determined respectively.
- the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, allowing the user to obtain The best sound experience.
- the technology of tracking and rendering the audio of the object in the entire video after selecting the sounding object can also be applied in professional mixing post-production, improving the work efficiency of the mixer.
- the media information includes the position text to be displayed in the multimedia file.
- the rendering device may determine the position of the first sound source based on the position text to be displayed in the multimedia file, where the position text is used to indicate the position of the first sound source.
- the position text can be understood as a text with meanings such as position and orientation, for example: wind blows north, heaven, hell, front, back, left, right, etc., which are not specifically limited here.
- the position text may specifically be lyrics, subtitles, advertisement slogans, etc., which is not specifically limited here.
- the semantics of the displayed location characters can be recognized based on reinforcement learning or a neural network, and then the location of the first sound source is determined according to the semantics.
- step 1003 various situations are described on how to determine the position of the first sound source based on the reference information.
- the position of the first sound source can also be determined in a combined manner. For example, after the position of the first sound source is determined by the sensor orientation, the motion trajectory of the first sound source position is determined by the music feature.
- the rendering device based on the first attitude angle of the sensor, the rendering device has determined that the position of the sound source of the human voice is as shown on the right side of the interface in FIG. 46 . Further, the user can further determine the movement trajectory of the human voice by clicking the "circle option" 125 in the menu on the right side of the "voice icon".
- the position of the first sound source at a certain moment is determined by the orientation of the sensor first, and the motion trajectory of the position of the first sound source is determined by using music features or preset rules as a circle.
- the interface of the rendering device can display the movement track of the generated object.
- the user can control the distance in the position of the first sound source by controlling the volume key of the mobile phone or clicking, dragging, sliding, etc. on the spherical view.
- Step 1004 Perform spatial rendering on the first single-object audio track based on the position of the first sound source.
- the rendering device may perform spatial rendering on the first single-object soundtrack, and obtain the rendered first single-object soundtrack.
- the rendering device performs spatial rendering on the first single-object audio track based on the position of the first sound source, and obtains the rendered first single-object audio track.
- the rendering device can also spatially render the first single-object audio track based on the first sound source position, and render the second single-object audio track based on the second sound source position, and obtain the rendered first single-object audio track. and a second single-object track.
- the method for determining the position of the sound source in this embodiment of the present application may apply the various methods in step 1003.
- the combination is not specifically limited here.
- the first sounding object is a person
- the second sounding object is a violin
- the position of the first sound source of the first single-object track corresponding to the first sounding object may adopt a certain method in interactive rendering.
- the second sound source position of the second single-object audio track corresponding to the second sound-emitting object may adopt a certain method in automatic rendering.
- the specific determination methods of the first sound source position and the second sound source position can be any two of the foregoing step 1003.
- the specific determination methods of the first sound source position and the second sound source position can also adopt the same method. , which is not specifically limited here.
- the spherical view may also include a volume bar, and the user can control the volume of the first single-object audio track by performing operations such as finger sliding, mouse dragging, mouse wheel, etc. on the volume bar, Improve the real-time performance of rendering audio tracks.
- the user can adjust the volume bar 126 to adjust the volume of the single-object track corresponding to the guitar.
- the rendering method in this step may be different. It can also be understood that the rendering device spatially renders the original audio track or the first single-object audio track based on the location of the first sound source and the type of playback device. The method varies according to the type of playback device, as described below:
- the playback device type is headphones.
- the audio track can be rendered based on Formula 13 and the HRTF filter coefficient table.
- the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
- the HRTF filter coefficient table is used to represent the relationship between the sound source position and the coefficient, and it can also be understood that one sound source position corresponds to one HRTF filter coefficient.
- a s (t) is the adjustment of the first sounding object at time t Coefficient
- h i,s (t) is the HRTF filter coefficient of the head-related transfer function of the left channel or right channel corresponding to the first sounding object at time t, where the left channel corresponding to the first sounding object at time t
- the HRTF filter coefficient is generally different from the HRTF filter coefficient of the right channel corresponding to the first sounding object at time t.
- the HRTF filter coefficient is related to the position of the first sound source, and o s (t) is at time t.
- the first single-object track of ⁇ is the integral term.
- the playback device type is an external playback device.
- the audio track can be rendered based on Formula Fourteen.
- the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
- the number of external devices can be N, is the rendered first single-object audio track, i indicates the ith channel in the multi-channel, S is the sounding object of the multimedia file and includes the first sounding object, and a s (t) is the first sounding at time t
- the adjustment coefficient of the object g s (t) represents the translation coefficient of the first sounding object at time t, os (t) is the first single-object track at time t, and ⁇ i is the calibrator (such as the aforementioned sensor device).
- the azimuth angle obtained by calibrating the i-th external device ⁇ i is the inclination angle obtained by the calibrator calibrating the i-th external device, ri is the distance between the i -th external device and the calibrator, and N is a positive Integer, i is a positive integer and i ⁇ N, the position of the first sound source is within a tetrahedron formed by N external devices.
- the single-object audio track corresponding to a certain sound-emitting object in the original audio track may be rendered and replaced, for example, S 1 in the above formula. It may also be a rendering of a single-object audio track corresponding to a certain sound-emitting object in the original audio track after duplication and addition, for example, S 2 in the above formula. Of course, it can also be a combination of the above S1 and S2 .
- Step 1005 Acquire a target audio track based on the rendered first single-object audio track.
- the target audio track obtained in this step may be different, and it can also be understood as the method used by the rendering device to obtain the target audio track, which varies according to the type of playback device, as described below:
- the playback device type is headphones.
- the rendering device may acquire the target audio track based on Formula 15 and the rendered audio track.
- the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
- i indicates left or right channel
- X i (t) is the original track at time t
- X i (t) is the original track at time t
- a s (t) is the adjustment coefficient of the first sounding object at time t
- hi is the left channel or right channel corresponding to the first sounding object at time t
- HRTF filter coefficient of the channel is related to the position of the first sound source
- o s (t) is the first single-object track at time t
- ⁇ is the integral term
- S 1 is the original The sounding object that needs to be replaced in the audio track, if the first sounding object is to replace the sounding object in the original audio track, then S1 is an empty set ;
- S2 is the sounding object added by the target audio track compared to the original audio track, if The first sounding object is the sounding object in the copied original audio track, then
- S2 is an empty set, it can be understood that the spatial rendering of the audio track is to replace the sound - emitting object. After spatial rendering of the single-object soundtrack corresponding to the sound-emitting object, the rendered single-object soundtrack is used to replace the original in the multimedia file.
- Single object audio track In other words, compared to the multimedia file, the target audio track does not have a single-object audio track corresponding to the multi-voice object, but replaces the original single-object audio track in the multimedia file with the rendered single-object audio track.
- the playback device type is an external playback device.
- the rendering device may acquire the target audio track based on Formula 16 and the rendered audio track.
- the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
- the number of external devices can be N, i indicates a certain channel in the multi-channel, is the target track at time t, X i (t) is the original track at time t, is the first single audio track that is not rendered at time t, is the rendered first single-object audio track, a s (t) is the adjustment coefficient of the first sound-emitting object at time t, g s (t) represents the translation coefficient of the first sound-emitting object at time t, g i,s ( t) represents the ith row in g s (t), o s (t) is the first single-object track at time t, and S 1 is the sounding object that needs to be replaced in the original soundtrack.
- S1 is an empty set
- S2 is the sounding object added by the target soundtrack compared to the original soundtrack.
- S 2 is an empty set
- S 1 and/or S 2 are the sound-emitting objects of the multimedia file and include the first sound-emitting object
- ⁇ i is the azimuth angle obtained by the calibrator calibrating the i-th external device
- ⁇ i is the calibrator calibration
- the inclination angle obtained by the ith external device, ri is the distance between the ith external device and the calibrator
- N is a positive integer
- i is a positive integer
- i ⁇ N the first sound source is located in N external speakers within the tetrahedron formed by the device.
- a new multimedia file can also be generated according to the multimedia file and the target audio track, which is not specifically limited here.
- the user can upload the setting method of the sound source position during the rendering process to the database module corresponding to the aforementioned Figure 8, so as to facilitate other users to use this setting method to render other audio tracks.
- the user can also download the setting method from the database module and modify it to facilitate the spatial rendering of the audio track.
- the modification of rendering rules and sharing between different users are added.
- the multimodal mode the repeated object recognition and tracking of the same file can be avoided, and the overhead on the end side can be reduced; on the other hand, the The user's free creation in the interactive mode is shared with other users to further enhance the interactivity of the application.
- the user may choose to synchronize the rendering rule file stored in the local database to other devices of the user.
- the user can choose to upload the rendering rule file stored in the local database to the cloud for sharing with other users, and other users can choose to download the corresponding rendering rule file from the cloud database to the terminal.
- the metadata file stored in the database is mainly used to render the sound-emitting object separated by the system or the object specified by the user in automatic mode, or the sound-emitting object specified by the user that needs to be automatically rendered according to the stored rendering rules in mixed mode. render.
- the metadata files stored in the database can be prefabricated by the system, as shown in Table 1.
- the serial numbers 1 and 2 in Table 1 can also be created by the user when using the interactive mode of the present invention, such as the serial numbers 3-6 in Table 1; they can also be automatically identified by the system in the multi-modal mode. It is stored after specifying the motion trajectory of the sounding object, such as No. 7 in Table 1.
- the metadata file can be strongly related to the audio content in the multimedia file or the multimodal file content: for example, serial number 3 in Table 1 is the metadata file corresponding to audio file A1, and serial number 4 is the metadata file corresponding to audio file A2; It can also be decoupled from the multimedia file: the user performs an interactive operation on the object X of the audio file A in the interactive mode, and saves the motion trajectory of the object X as the corresponding metadata file (for example, the serial number 5 in Table 1 hovers freely. rising state), when using automatic rendering next time, the user can select the metadata file in the free hovering rising state from the database module to render the object Y of the audio file B.
- serial number 3 in Table 1 is the metadata file corresponding to audio file A1
- serial number 4 is the metadata file corresponding to audio file A2
- It can also be decoupled from the multimedia file: the user performs an interactive operation on the object X of the audio file A in the interactive mode, and saves the motion trajectory of the object X as the corresponding metadata file (for
- the rendering method provided by this embodiment of the present application includes steps 1001 to 1005 .
- the rendering method provided by this embodiment of the present application includes steps 1002 to 1005.
- the rendering method provided by this embodiment of the present application includes steps 1001 to 1004 .
- the rendering method provided by this embodiment of the present application includes steps 1002 to 1004 .
- each step shown in FIG. 10 in the embodiment of the present application does not limit the timing relationship.
- step 1001 in the above method may also be performed after step 1002, that is, after the audio track is acquired, and the playback device is calibrated.
- the user can control the audio image, quantity, and volume of a specific sound-emitting object through the mobile phone sensor, drag and drop the sound image, quantity, and volume of the specific sound-emitting object through the mobile phone interface, and control the music through automated rules.
- the spatial rendering of specific sound-emitting objects in the system improves spatiality, the automatic rendering of sound source positions through multi-modal recognition, and the method of rendering for single-emitting objects provides a completely different sound effect experience from the traditional music movie interaction mode. It provides a new interactive way for music appreciation.
- automated 3D re-production enhances the sense of space of binaural music and brings listening to a new level.
- the interactive method designed by us has been added separately to enhance the user's ability to edit audio, which can be applied to the production of sound-emitting objects in music and film and television works, and simply edit the motion information of specific sound-emitting objects. It also increases the user's control and playability of music, allowing users to experience the fun of making audio by themselves and the ability to control specific sound-emitting objects.
- the present application also provides two specific application scenarios for applying the above-mentioned rendering method, which are described below:
- the first is the "Sound Hunter" game scene.
- This scenario can also be understood as the user points to the sound source position and judges whether the user's pointing is consistent with the actual sound source position, and scores the user's operation to improve the user's entertainment experience.
- the device can display the interface shown in Figure 51.
- the user can click the play button at the bottom of the interface to confirm the start of the game, and the playing device will play at least one single-object track in a certain order and at any position.
- the playback device plays the single-object track of the piano, the user judges the position of the sound source by hearing, and holds the mobile phone to point to the position of the sound source judged by the user. If it is consistent (or the error is within a certain range), the rendering device can display the prompt "hit the first instrument, it took 5.45 seconds, and beat 99.33% of the people in the universe" as shown in the right interface of Figure 51.
- the corresponding sounding object in the object bar can change from red to green.
- a failure may be displayed.
- the preset time period time interval T in FIG. 54
- the next single-object track is played to continue the game, as shown in FIG. 52 and FIG. 53 .
- the corresponding sounding object in the object bar can remain red.
- the rendering device can display the interface shown in FIG. 53 .
- the orientation of the audio of the object is rendered in the playback system in real time, and the game is designed so that the user can obtain the ultimate "listening to position" experience. It can be applied to home entertainment, AR, VR games, etc. Compared with the prior art technology about "listening to the voice", which is only for a whole song, the present application provides a game that is played after separating the vocal instruments for a song.
- the second is a multi-person interactive scene.
- This scene can be understood as multiple users controlling the sound source position of a specific sounding object respectively, so as to realize the rendering of the sound track by multiple people, and increase the entertainment and communication among the multiple users.
- the interactive scene may specifically be an online multi-person band or an online host controlling a symphony, etc.
- the multimedia file is music composed of multiple musical instruments.
- User A can select a multi-person interaction mode and invite user B to complete the creation together.
- the rendering tracks given by the corresponding users are rendered and remixed respectively, and then the remixed audio files are sent to each participating user.
- the interaction modes selected by different users may be different, which are not specifically limited here. For example, as shown in Figure 55, user A selects the interactive mode to interactively control the position of the pair of object A by changing the orientation of the mobile phone he uses, and user B selects the interactive mode to control the position of the pair of object B by changing the orientation of the mobile phone he uses for interactive control.
- the system can send the remixed audio file to each user participating in the multi-person interactive application.
- the position of object A and the position of object B in the audio file are different from the user's A corresponds to the manipulation of user B.
- user A selects an input multimedia file
- the system identifies the object information in the input file, and feeds back to user A through the UI interface.
- User A selects the mode. If user A selects the multi-person interaction mode, user A sends a multi-person interaction request to the system, and sends the information of the designated invitee to the system.
- the system sends an interaction request to user B selected by user A. If user B accepts the request, it sends an acceptance instruction to the system to join the multi-person interactive application created by user A.
- User A and User B respectively select the sound-emitting object to be operated, and use the above-mentioned rendering mode to control the selected sound-emitting object, and file the corresponding rendering rule.
- the system separates the single-object audio tracks through the separation network, and renders the separated single-object audio tracks according to the rendering track provided by the user corresponding to the sounding object, and then remixes the rendered single-object audio tracks to obtain the target audio. track, and then send the target audio track to each participating user.
- the multiplayer interaction mode can be the real-time online multiplayer interaction described in the above example, or the multiplayer interaction in an offline situation.
- the multimedia file selected by user A is duet music, including singer A and singer B.
- user A can select the interactive mode to control the rendering effect of singer A, and share the re-rendered target audio track to user B; user B can use the received target audio track shared by user A as Input the file to control the rendering effect of singer B.
- the interaction modes selected by different users may be the same or different, which are not specifically limited here.
- real-time and non-real-time interactive rendering control with multiple people is supported, and users can invite other users to re-render and create different sound-emitting objects in multimedia files, enhancing the interactive experience and the fun of the application.
- the multi-person cooperatively implemented by the above-mentioned method performs the audio-visual control of different objects, thereby realizing the rendering of the multimedia file by the multi-person.
- An embodiment of the rendering device in the embodiment of the present application includes:
- Obtaining unit 5801 is used to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
- a determining unit 5802 configured to determine the first sound source position of the first sounding object based on reference information, the reference information includes reference position information and/or media information of a multimedia file, and the reference position information is used to indicate the first sound source position;
- the rendering unit 5803 is configured to perform spatial rendering on the first single-object audio track based on the first sound source position, so as to obtain the rendered first single-object audio track.
- each unit in the rendering device is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 11 , and details are not repeated here.
- the obtaining unit 5801 obtains the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object; the determining unit 5802 determines the first sound source position of the first sounding object based on the reference information, The rendering unit 5803 spatially renders the first single-object audio track based on the position of the first sound source to obtain the rendered first single-object audio track.
- the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects.
- another embodiment of the rendering device in the embodiment of the present application includes:
- Obtaining unit 5901 is used to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
- a determining unit 5902 configured to determine the first sound source position of the first sounding object based on reference information, the reference information includes reference position information and/or media information of a multimedia file, and the reference position information is used to indicate the first sound source position;
- the rendering unit 5903 is configured to perform spatial rendering on the first single-object audio track based on the position of the first sound source, so as to obtain the rendered first single-object audio track.
- the providing unit 5904 is used to provide a spherical view for the user to select, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device;
- the sending unit 5905 is used for sending the target audio track to the playback device, and the playback device is used for playing the target audio track.
- each unit in the rendering device is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 11 , and details are not repeated here.
- the obtaining unit 5901 obtains the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object; the determining unit 5902 determines the first sound source position of the first sounding object based on the reference information, The rendering unit 5903 spatially renders the first single-object audio track based on the first sound source position, so as to obtain the rendered first single-object audio track.
- the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects. In addition, it provides a sound effect experience that is completely different from the traditional music movie interaction mode. It provides a new interactive way for music appreciation.
- automated 3D re-production enhances the sense of space of binaural music and brings listening to a new level.
- the interactive method designed by us has been added separately to enhance the user's ability to edit audio, which can be applied to the production of sound-emitting objects in music and film and television works, and simply edit the motion information of specific sound-emitting objects. It also increases the user's control and playability of music, allowing users to experience the fun of making audio by themselves and the ability to control specific sound-emitting objects.
- an embodiment of the rendering device in the embodiment of the present application includes:
- the obtaining unit 6001 is further configured to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
- a display unit 6002 configured to display a user interface, where the user interface includes rendering mode options;
- a determining unit 6003 configured to respond to the user's first operation on the user interface, and determine an automatic rendering mode or an interactive rendering mode from the rendering mode options;
- the acquiring unit 6001 is further configured to acquire the rendered first single-object audio track based on the preset mode when the determination unit determines the automatic rendering mode; or
- the obtaining unit 6001 is further configured to obtain reference position information in response to the second operation of the user when the determination unit determines the interactive rendering mode; determine the first sound source position of the first sound-emitting object based on the reference position information; The sound source position renders the first single-object audio track to obtain the rendered first single-object audio track.
- each unit in the rendering device is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 11 , and details are not repeated here.
- the determining unit 6003 determines the automatic rendering mode or the interactive rendering mode from the rendering mode options according to the user's first operation. Object track.
- the spatial rendering of the audio track corresponding to the first sound-emitting object in the multimedia file can be realized through the interaction between the rendering device and the user, so as to provide the user with an immersive stereo sound effect.
- the rendering device may include a processor 6101 , a memory 6102 and a communication interface 6103 .
- the processor 6101, the memory 6102 and the communication interface 6103 are interconnected by wires.
- the memory 6102 stores program instructions and data.
- the memory 6102 stores program instructions and data corresponding to the steps performed by the rendering device in the corresponding embodiments shown in FIG. 5 to FIG. 11 .
- the processor 6101 is configured to perform the steps performed by the rendering device shown in any of the foregoing embodiments shown in FIG. 5 to FIG. 11 .
- the communication interface 6103 may be used to receive and transmit data, and to perform the steps related to acquisition, transmission, and reception in any of the foregoing embodiments shown in FIG. 5 to FIG. 11 .
- the rendering device may include more or less components relative to FIG. 61 , which are merely illustrative and not limited in this application.
- This embodiment of the present application also provides a sensor device, as shown in FIG. 62 .
- the sensor device can be any terminal device including a mobile phone, tablet computer, etc. Taking the sensor as a mobile phone as an example:
- FIG. 62 is a block diagram showing a partial structure of a sensor device-mobile phone provided by an embodiment of the present application.
- the mobile phone includes: a radio frequency (RF) circuit 6210, a memory 6220, an input unit 6230, a display unit 6240, a sensor 6250, an audio circuit 6260, a wireless fidelity (WiFi) module 6270, and a processor 6280 , and the power supply 6290 and other components.
- RF radio frequency
- a memory 6220 the structure of the mobile phone shown in FIG. 62 does not constitute a limitation on the mobile phone, and may include more or less components than the one shown, or combine some components, or arrange different components.
- WiFi wireless fidelity
- the RF circuit 6210 can be used for receiving and sending signals during sending and receiving of information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 6280; in addition, it sends the designed uplink data to the base station.
- the RF circuit 6210 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
- the RF circuit 6210 can also communicate with networks and other devices via wireless communication.
- the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to the global system of mobile communication (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access) multiple access, CDMA), wideband code division multiple access (WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS) and so on.
- GSM global system of mobile communication
- general packet radio service general packet radio service
- GPRS code division multiple access
- CDMA code division multiple access
- WCDMA wideband code division multiple access
- long term evolution long term evolution
- email short message service
- the memory 6220 can be used to store software programs and modules, and the processor 6280 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 6220 .
- the memory 6220 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc.
- memory 6220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
- the input unit 6230 can be used for receiving inputted numerical or character information, and generating key signal input related to user setting and function control of the mobile phone.
- the input unit 6230 may include a touch panel 6231 and other input devices 6232 .
- the touch panel 6231 also known as the touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 6231). operation), and drive the corresponding connection device according to the preset program.
- the touch panel 6231 may include two parts, a touch detection device and a touch controller.
- the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller.
- the touch panel 6231 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
- the input unit 6230 may further include other input devices 6232.
- other input devices 6232 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.
- the display unit 6240 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
- the display unit 6240 may include a display panel 6241.
- the display panel 6241 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
- the touch panel 6231 can cover the display panel 6241. When the touch panel 6231 detects a touch operation on or near it, it transmits it to the processor 6280 to determine the type of the touch event, and then the processor 6280 determines the type of the touch event according to the touch event. Type provides corresponding visual output on display panel 6241.
- the touch panel 6231 and the display panel 6241 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 6231 and the display panel 6241 can be integrated to form Realize the input and output functions of the mobile phone.
- the cell phone may also include at least one sensor 6250, such as light sensors, motion sensors, and other sensors.
- the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 6241 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 6241 and/or when the mobile phone is moved to the ear. or backlight.
- the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary.
- the audio circuit 6260, the speaker 6262, and the microphone 6262 can provide the audio interface between the user and the mobile phone.
- the audio circuit 6260 can convert the received audio data into an electrical signal, and transmit it to the speaker 6262, and the speaker 6262 converts it into a sound signal for output; on the other hand, the microphone 6262 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 6260 After receiving, it is converted into audio data, and then the audio data is output to the processor 6280 for processing, and then sent to, for example, another mobile phone through the RF circuit 6210, or the audio data is output to the memory 6220 for further processing.
- WiFi is a short-distance wireless transmission technology.
- the mobile phone can help users to send and receive emails, browse web pages and access streaming media through the WiFi module 6270. It provides users with wireless broadband Internet access.
- FIG. 62 shows the WiFi module 6270, it can be understood that it is not a necessary component of the mobile phone.
- the processor 6280 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 6220, and calling the data stored in the memory 6220.
- the processor 6280 may include one or more processing units; preferably, the processor 6280 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 6280.
- the mobile phone also includes a power supply 6290 (such as a battery) for supplying power to various components.
- a power supply 6290 (such as a battery) for supplying power to various components.
- the power supply can be logically connected to the processor 6280 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.
- the mobile phone may also include a camera, a Bluetooth module, and the like, which will not be repeated here.
- the processor 6280 included in the mobile phone may perform the functions in the foregoing embodiments shown in FIG. 5 to FIG. 11 , which will not be repeated here.
- the disclosed system, apparatus and method may be implemented in other manners.
- the apparatus embodiments described above are only illustrative.
- the division of units is only a logical function division.
- there may be other division methods for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Stereophonic System (AREA)
Abstract
Description
Claims (38)
- 一种渲染方法,其特征在于,包括:基于多媒体文件获取第一单对象音轨,所述第一单对象音轨与第一发声对象对应;基于参考信息确定所述第一发声对象的第一声源位置,所述参考信息包括参考位置信息和/或所述多媒体文件的媒体信息,所述参考位置信息用于指示所述第一声源位置;基于所述第一声源位置对所述第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
- 根据权利要求1所述的方法,其特征在于,所述媒体信息包括:所述多媒体文件中需要显示的文字、所述多媒体文件中需要显示的图像、所述多媒体文件中需要播放的音乐的音乐特征以及所述第一发声对象对应的声源类型中的至少一种。
- 根据权利要求1或2所述的方法,其特征在于,所述参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
- 根据权利要求1至3中任一项所述的方法,其特征在于,所述方法还包括:确定播放设备的类型,所述播放设备用于播放目标音轨,所述目标音轨根据所述渲染后的第一单对象音轨获取;所述基于所述第一声源位置对所述第一单对象音轨进行空间渲染,包括:基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染。
- 根据权利要求2所述的方法,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述图像且所述图像包括所述第一发声对象时,所述基于参考信息确定所述第一发声对象的第一声源位置,包括:确定所述图像内所述第一发声对象的第三位置信息,所述第三位置信息包括所述第一发声对象在所述图像内的二维坐标以及深度;基于所述第三位置信息获取所述第一声源位置。
- 根据权利要求2或5所述的方法,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要播放的音乐的音乐特征时,所述基于参考信息确定所述第一发声对象的第一声源位置,包括:基于关联关系与所述音乐特征确定所述第一声源位置,所述关联关系用于表示所述音乐特征与所述第一声源位置的关联。
- 根据权利要求2所述的方法,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要显示的文字且所述文字包含有与位置相关的位置文字时,所述基于参考信息确定所述第一发声对象的第一声源位置,包括:识别所述位置文字;基于所述位置文字确定所述第一声源位置。
- 根据权利要求3所述的方法,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第一位置信息时,所述基于参考信息确定所述第一发声对象的第一声源位置前,所述方法还包括:获取所述第一位置信息,所述第一位置信息包括所述传感器的第一姿态角以及所述传感器与播放设备之间的距离;所述基于参考信息确定所述第一发声对象的第一声源位置,包括:将所述第一位置信息转化为所述第一声源位置。
- 根据权利要求3所述的方法,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第一位置信息时,所述基于参考信息确定所述第一发声对象的第一声源位置前,所述方法还包括:获取所述第一位置信息,所述第一位置信息包括所述传感器的第二姿态角以及所述传感器的加速度;所述基于参考信息确定所述第一发声对象的第一声源位置,包括:将所述第一位置信息转化为所述第一声源位置。
- 根据权利要求3所述的方法,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第二位置信息时,所述基于参考信息确定所述第一发声对象的第一声源位置前,所述方法还包括:提供球视图供用户选择,所述球视图的圆心为所述用户所在的位置,所述球视图的半径为所述用户的位置与播放设备的距离;获取用户在所述球视图中选择的所述第二位置信息;所述基于参考信息确定所述第一发声对象的第一声源位置,包括:将所述第二位置信息转化为所述第一声源位置。
- 根据权利要求1至10任一项所述的方法,其特征在于,所述基于多媒体文件获取第一单对象音轨,包括:从所述多媒体文件中的原始音轨中分离出所述第一单对象音轨,所述原始音轨至少由所述第一单对象音轨以及第二单对象音轨合成获取,所述第二单对象音轨与第二发声对象对应。
- 根据权利要求11所述的方法,其特征在于,所述从所述多媒体文件中的原始音轨中分离出所述第一单对象音轨,包括:通过训练好的分离网络从所述原始音轨中分离出所述第一单对象音轨。
- 根据权利要求12所述的方法,其特征在于,所述训练好的分离网络是通过以训练数据作为所述分离网络的输入,以损失函数的值小于第一阈值为目标对分离网络进行训练获取,所述训练数据包括训练音轨,所述训练音轨至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,所述初始第三单对象音轨与第三发声对象对应,所述初始第四单对象音轨与第四发声对象对应,所述第三发声对象与所述第一发声对象的属于相同类型,所述第二发声对象与所述第四发声对象的属于相同类型,所述分离网络的输出包括分离获取的第三单对象音轨;所述损失函数用于指示所述分离获取的第三单对象音轨与所述初始第三单对象音轨之间的差异。
- 根据权利要求4所述的方法,其特征在于,所述基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染,包括:若所述播放设备为N个外放设备,通过如下公式获取所述渲染后的第一单对象音轨;
- 根据权利要求4所述的方法,其特征在于,所述方法还包括:基于所述渲染后的第一单对象音轨、所述多媒体文件中的原始音轨以及所述播放设备的类型,获取目标音轨;向所述播放设备发送所述目标音轨,所述播放设备用于播放所述目标音轨。
- 根据权利要求16所述的方法,其特征在于,所述基于所述渲染后的第一单对象音轨、所述多媒体文件中的原始音轨以及所述播放设备的类型,获取目标音轨,包括:若所述播放设备的类型为耳机,通过如下公式获取所述目标音轨:其中,i指示左声道或右声道, 为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨, 为所述t时刻下未被渲染的所述第一单对象音轨, 为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,h i,s(t)为所述t时刻下所述第一发声对象对应的所述左声道或所述右声道的头相关传输函数HRTF滤波器系数,所述HRTF滤波器系数与所述第一声源位置相关,o s(t)为所述t时刻下的所述第一单对象音轨,τ为积分项,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制的所述原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象。
- 根据权利要求16所述的方法,其特征在于,所述基于所述渲染后的第一单对象音轨、所述多媒体文件中的原始音轨以及所述播放设备的类型,获取目标音轨,包括:若所述播放设备的类型为N个外放设备,通过如下公式获取所述目标音轨:其中,i指示多声道中的第i个声道, 为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨, 为所述t时刻下未被渲染的所述第一单音轨, 为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,g s(t)代表所述t时刻下所述第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为所述t时刻下的所述第一单对象音轨,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制所述原 始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为所述校准器校准所述第i个外放设备所获取倾斜角,r i为所述第i个外放设备与所述校准器的距离,N为正整数,i为正整数且i≤N,所述第一声源位置在所述N个外放设备构成的四面体内。
- 一种渲染设备,其特征在于,包括:获取单元,用于基于多媒体文件获取第一单对象音轨,所述第一单对象音轨与第一发声对象对应;确定单元,用于基于参考信息确定所述第一发声对象的第一声源位置,所述参考信息包括参考位置信息和/或所述多媒体文件的媒体信息,所述参考位置信息用于指示所述第一声源位置;渲染单元,用于基于所述第一声源位置对所述第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
- 根据权利要求19所述的渲染设备,其特征在于,所述媒体信息包括:所述多媒体文件中需要显示的文字、所述多媒体文件中需要显示的图像、所述多媒体文件中需要播放的音乐的音乐特征以及所述第一发声对象对应的声源类型中的至少一种。
- 根据权利要求19或20所述的渲染设备,其特征在于,所述参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
- 根据权利要求19至21中任一项所述的渲染设备,其特征在于,所述确定单元,还用于确定播放设备的类型,所述播放设备用于播放目标音轨,所述目标音轨根据所述渲染后的第一单对象音轨获取;所述渲染单元,具体用于基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染。
- 根据权利要求20所述的渲染设备,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述图像且所述图像包括所述第一发声对象时,所述确定单元,具体用于确定所述图像内所述第一发声对象的第三位置信息,所述第三位置信息包括所述第一发声对象在所述图像内的二维坐标以及深度;所述确定单元,具体用于基于所述第三位置信息获取所述第一声源位置。
- 根据权利要求20或23所述的渲染设备,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要播放的音乐的音乐特征时,所述确定单元,具体用于基于关联关系与所述音乐特征确定所述第一声源位置,所述关联关系用于表示所述音乐特征与所述第一声源位置的关联。
- 根据权利要求20所述的渲染设备,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要显示的文字且所述文字包含有与位置相关的位置文字时,所述确定单元,具体用于识别所述位置文字;所述确定单元,具体用于基于所述位置文字确定所述第一声源位置。
- 根据权利要求21所述的渲染设备,其特征在于,所述参考信息包括参考位置信息, 当所述参考位置信息包括所述第一位置信息时,所述获取单元,还用于获取所述第一位置信息,所述第一位置信息包括所述传感器的第一姿态角以及所述传感器与播放设备之间的距离;所述确定单元,具体用于将所述第一位置信息转化为所述第一声源位置。
- 根据权利要求21所述的渲染设备,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第一位置信息时,所述获取单元,还用于获取所述第一位置信息,所述第一位置信息包括所述传感器的第二姿态角以及所述传感器的加速度;所述确定单元,具体用于将所述第一位置信息转化为所述第一声源位置。
- 根据权利要求21所述的渲染设备,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第二位置信息时,所述渲染设备还包括:提供单元,用于提供球视图供用户选择,所述球视图的圆心为所述用户所在的位置,所述球视图的半径为所述用户的位置与播放设备的距离;所述获取单元,还用于获取用户在所述球视图中选择的所述第二位置信息;所述确定单元,具体用于将所述第二位置信息转化为所述第一声源位置。
- 根据权利要求19至28任一项所述的渲染设备,其特征在于,所述获取单元,具体用于从所述多媒体文件中的原始音轨中分离出所述第一单对象音轨,所述原始音轨至少由所述第一单对象音轨以及第二单对象音轨合成获取,所述第二单对象音轨与第二发声对象对应。
- 根据权利要求29所述的渲染设备,其特征在于,所述获取单元,具体用于通过训练好的分离网络从所述原始音轨中分离出所述第一单对象音轨。
- 根据权利要求30所述的渲染设备,其特征在于,所述训练好的分离网络是通过以训练数据作为所述分离网络的输入,以损失函数的值小于第一阈值为目标对分离网络进行训练获取,所述训练数据包括训练音轨,所述训练音轨至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,所述初始第三单对象音轨与第三发声对象对应,所述初始第四单对象音轨与第四发声对象对应,所述第三发声对象与所述第一发声对象的属于相同类型,所述第二发声对象与所述第四发声对象的属于相同类型,所述分离网络的输出包括分离获取的第三单对象音轨;所述损失函数用于指示所述分离获取的第三单对象音轨与所述初始第三单对象音轨之间的差异。
- 根据权利要求22所述的渲染设备,其特征在于,若所述播放设备为N个外放设备,所述获取单元,具体用于通过如下公式获取所述渲染后的第一单对象音轨;
- 根据权利要求22所述的渲染设备,其特征在于,所述获取单元,还用于基于所述渲染后的第一单对象音轨以及所述多媒体文件中的原始音轨,获取目标音轨;所述渲染设备还包括:发送单元,用于向所述播放设备发送所述目标音轨,所述播放设备用于播放所述目标音轨。
- 根据权利要求34所述的渲染设备,其特征在于,若所述播放设备为耳机,所述获取单元,具体用于通过如下公式获取所述目标音轨:其中,i指示左声道或右声道, 为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨, 为所述t时刻下未被渲染的所述第一单对象音轨, 为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,h i,s(t)为所述t时刻下所述第一发声对象对应的所述左声道或所 述右声道的头相关传输函数HRTF滤波器系数,所述HRTF滤波器系数与所述第一声源位置相关,o s(t)为所述t时刻下的所述第一单对象音轨,τ为积分项,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制所述原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象。
- 根据权利要求34所述的渲染设备,其特征在于,若所述播放设备为N个外放设备,所述获取单元,具体用于通过如下公式获取所述目标音轨:其中,i指示多声道中的第i个声道, 为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨, 为所述t时刻下未被渲染的所述第一单音轨, 为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,g s(t)代表所述t时刻下所述第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为所述t时刻下的所述第一单对象音轨,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制所述原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为所述校准器校准所述第i个外放设备所获取的倾斜角,r i为所述第i个外放设备与所述校准器的距离,N为正整数,i为正整数且i≤N,所述第一声源位置在所述N个外放设备构成的四面体内。
- 一种渲染设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述渲染设备执行如权利要求1至18中任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至18中任一项所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023565286A JP2024515736A (ja) | 2021-04-29 | 2022-04-18 | レンダリング方法および関連するデバイス |
EP22794645.6A EP4294026A1 (en) | 2021-04-29 | 2022-04-18 | Rendering method and related device |
US18/498,002 US20240064486A1 (en) | 2021-04-29 | 2023-10-30 | Rendering method and related device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110477321.0 | 2021-04-29 | ||
CN202110477321.0A CN115278350A (zh) | 2021-04-29 | 2021-04-29 | 一种渲染方法及相关设备 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/498,002 Continuation US20240064486A1 (en) | 2021-04-29 | 2023-10-30 | Rendering method and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022228174A1 true WO2022228174A1 (zh) | 2022-11-03 |
Family
ID=83745121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/087353 WO2022228174A1 (zh) | 2021-04-29 | 2022-04-18 | 一种渲染方法及相关设备 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20240064486A1 (zh) |
EP (1) | EP4294026A1 (zh) |
JP (1) | JP2024515736A (zh) |
CN (1) | CN115278350A (zh) |
WO (1) | WO2022228174A1 (zh) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106448687A (zh) * | 2016-09-19 | 2017-02-22 | 中科超影(北京)传媒科技有限公司 | 音频制作及解码的方法和装置 |
CN109983786A (zh) * | 2016-11-25 | 2019-07-05 | 索尼公司 | 再现装置、再现方法、信息处理装置、信息处理方法以及程序 |
CN110972053A (zh) * | 2019-11-25 | 2020-04-07 | 腾讯音乐娱乐科技(深圳)有限公司 | 构造听音场景的方法和相关装置 |
CN111526242A (zh) * | 2020-04-30 | 2020-08-11 | 维沃移动通信有限公司 | 音频处理方法、装置和电子设备 |
CN112037738A (zh) * | 2020-08-31 | 2020-12-04 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音乐数据的处理方法、装置及计算机存储介质 |
CN112291615A (zh) * | 2020-10-30 | 2021-01-29 | 维沃移动通信有限公司 | 音频输出方法、音频输出装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9980078B2 (en) * | 2016-10-14 | 2018-05-22 | Nokia Technologies Oy | Audio object modification in free-viewpoint rendering |
US10178490B1 (en) * | 2017-06-30 | 2019-01-08 | Apple Inc. | Intelligent audio rendering for video recording |
US10872115B2 (en) * | 2018-03-19 | 2020-12-22 | Motorola Mobility Llc | Automatically associating an image with an audio track |
-
2021
- 2021-04-29 CN CN202110477321.0A patent/CN115278350A/zh active Pending
-
2022
- 2022-04-18 JP JP2023565286A patent/JP2024515736A/ja active Pending
- 2022-04-18 WO PCT/CN2022/087353 patent/WO2022228174A1/zh active Application Filing
- 2022-04-18 EP EP22794645.6A patent/EP4294026A1/en active Pending
-
2023
- 2023-10-30 US US18/498,002 patent/US20240064486A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106448687A (zh) * | 2016-09-19 | 2017-02-22 | 中科超影(北京)传媒科技有限公司 | 音频制作及解码的方法和装置 |
CN109983786A (zh) * | 2016-11-25 | 2019-07-05 | 索尼公司 | 再现装置、再现方法、信息处理装置、信息处理方法以及程序 |
CN110972053A (zh) * | 2019-11-25 | 2020-04-07 | 腾讯音乐娱乐科技(深圳)有限公司 | 构造听音场景的方法和相关装置 |
CN111526242A (zh) * | 2020-04-30 | 2020-08-11 | 维沃移动通信有限公司 | 音频处理方法、装置和电子设备 |
CN112037738A (zh) * | 2020-08-31 | 2020-12-04 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音乐数据的处理方法、装置及计算机存储介质 |
CN112291615A (zh) * | 2020-10-30 | 2021-01-29 | 维沃移动通信有限公司 | 音频输出方法、音频输出装置 |
Also Published As
Publication number | Publication date |
---|---|
JP2024515736A (ja) | 2024-04-10 |
US20240064486A1 (en) | 2024-02-22 |
CN115278350A (zh) | 2022-11-01 |
EP4294026A1 (en) | 2023-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020224322A1 (zh) | 音乐文件的处理方法、装置、终端及存储介质 | |
JP7283496B2 (ja) | 情報処理方法、情報処理装置およびプログラム | |
US20200374645A1 (en) | Augmented reality platform for navigable, immersive audio experience | |
CN109564504A (zh) | 用于基于移动处理空间化音频的多媒体装置 | |
CN111179961A (zh) | 音频信号处理方法、装置、电子设备及存储介质 | |
CN112037738B (zh) | 一种音乐数据的处理方法、装置及计算机存储介质 | |
US20060179160A1 (en) | Orchestral rendering of data content based on synchronization of multiple communications devices | |
CN108270794B (zh) | 内容发布方法、装置及可读介质 | |
EP4336490A1 (en) | Voice processing method and related device | |
CN110322760A (zh) | 语音数据生成方法、装置、终端及存储介质 | |
WO2021114808A1 (zh) | 音频处理方法、装置、电子设备和存储介质 | |
JP7277611B2 (ja) | テキスト類似性を使用した視覚的タグのサウンドタグへのマッピング | |
WO2023207541A1 (zh) | 一种语音处理方法及相关设备 | |
CN113823250B (zh) | 音频播放方法、装置、终端及存储介质 | |
CN114073854A (zh) | 基于多媒体文件的游戏方法和*** | |
JP2021101252A (ja) | 情報処理方法、情報処理装置およびプログラム | |
CN111428079B (zh) | 文本内容处理方法、装置、计算机设备及存储介质 | |
US20220246135A1 (en) | Information processing system, information processing method, and recording medium | |
Heise et al. | Soundtorch: Quick browsing in large audio collections | |
WO2022267468A1 (zh) | 一种声音处理方法及其装置 | |
US20200073885A1 (en) | Image display apparatus and operation method of the same | |
EP4252195A1 (en) | Real world beacons indicating virtual locations | |
CN114286275A (zh) | 一种音频处理方法及装置、存储介质 | |
CN110087122A (zh) | 用于处理信息的***、方法和装置 | |
WO2022228174A1 (zh) | 一种渲染方法及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22794645 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022794645 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022794645 Country of ref document: EP Effective date: 20230914 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023565286 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |