WO2022228174A1 - 一种渲染方法及相关设备 - Google Patents

一种渲染方法及相关设备 Download PDF

Info

Publication number
WO2022228174A1
WO2022228174A1 PCT/CN2022/087353 CN2022087353W WO2022228174A1 WO 2022228174 A1 WO2022228174 A1 WO 2022228174A1 CN 2022087353 W CN2022087353 W CN 2022087353W WO 2022228174 A1 WO2022228174 A1 WO 2022228174A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio track
sound
sounding
sound source
rendering
Prior art date
Application number
PCT/CN2022/087353
Other languages
English (en)
French (fr)
Inventor
杜旭浩
李向宇
李硕
范泛
李海婷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2023565286A priority Critical patent/JP2024515736A/ja
Priority to EP22794645.6A priority patent/EP4294026A1/en
Publication of WO2022228174A1 publication Critical patent/WO2022228174A1/zh
Priority to US18/498,002 priority patent/US20240064486A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • H04S7/304For headphones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/301Automatic calibration of stereophonic sound system, e.g. with test microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/40Visual indication of stereophonic sound image
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • the present application relates to the field of audio applications, and in particular, to a rendering method and related equipment.
  • audio and video playback devices can use processing technologies such as head related transfer function (HRTF) to process the audio and video data to be played.
  • HRTF head related transfer function
  • the embodiment of the present application provides a rendering method, which can improve the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file, and provide users with immersive stereoscopic sound effects.
  • a first aspect of the embodiments of the present application provides a rendering method, which can be applied to scenes such as music, film and television production, etc.
  • the method can be executed by a rendering device, or can be executed by a component of the rendering device (for example, a processor, a chip, or a chip system). etc.) execute.
  • the method includes: acquiring a first single-object audio track based on a multimedia file, where the first single-object audio track corresponds to a first sound-emitting object; determining a first sound source position of the first sound-emitting object based on reference information, where the reference information includes reference position information and /or media information of the multimedia file, the reference position information is used to indicate the position of the first sound source; spatially render the first single-object audio track based on the first sound source position to obtain the rendered first single-object audio track.
  • the first single-object soundtrack is obtained based on the multimedia file, and the first single-object soundtrack corresponds to the first sound-emitting object; the position of the first sound source of the first sound-emitting object is determined based on the reference information, and the first sound source position to spatially render the first single-object audio track to obtain the rendered first single-object audio track.
  • the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects.
  • the media information in the above steps includes: the text to be displayed in the multimedia file, the image to be displayed in the multimedia file, and the music of the music to be played in the multimedia file. at least one of a feature and a sound source type corresponding to the first sound-emitting object.
  • the rendering device can perform orientation and dynamic settings on the extracted specific sound-emitting object according to the music features of the music, so that the sound track corresponding to the sound-emitting object is rendered in 3D. It is more natural and the artistry is better reflected. If the media information includes text, images, etc., the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, so that the user can obtain the best sound effect experience. In addition, if the media information includes video, tracking the sounding objects in the video and rendering the audio tracks corresponding to the sounding objects in the entire video can also be used in professional mixing post-production to improve the mixer's work efficiency.
  • the reference location information in the above steps includes first location information of the sensor or second location information selected by the user.
  • the reference position information includes the first position information of the sensor
  • the user can perform real-time or later dynamic rendering on the selected sound-emitting object through the orientation or position provided by the sensor.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • the reference position information includes the second position information selected by the user
  • the user can control the selected sound-emitting object by dragging the interface and perform real-time or later dynamic rendering, giving it a specific spatial orientation and motion, which can realize the user Interactive creation with audio to provide users with a new experience.
  • the above steps further include: determining the type of the playback device, the playback device is used to play the target audio track, and the target audio track is based on the rendered first single-object audio track. Acquiring: performing spatial rendering on the first single-object audio track based on the first sound source position, including: performing spatial rendering on the first single-object audio track based on the first sound source position and the type of playback device.
  • the type of playback device is considered when spatially rendering the audio track.
  • Different playback device types can correspond to different spatial rendering formulas, so that the spatial effect of the rendered first single-object audio track by the playback device in the later stage is more realistic and accurate.
  • the reference information in the above steps includes media information, and when the media information includes an image and the image includes a first sound-emitting object, the reference information is used to determine the first sound-emitting object.
  • the first sound source position includes: determining third position information of the first sound-emitting object in the image, where the third position information includes the two-dimensional coordinates and depth of the first sound-emitting object in the image; acquiring the first sound source based on the third position information Location.
  • the 3D immersion is rendered in the earphone or external environment, and the real sound follows the picture. Allows users to obtain the best sound experience.
  • the technology of tracking and rendering the audio of the object in the entire video after selecting the sounding object can also be applied in professional mixing post-production, improving the work efficiency of the mixer.
  • the reference information in the above steps includes media information, and when the media information includes the music characteristics of the music to be played in the multimedia file, the first utterance is determined based on the reference information.
  • the first sound source position of the object includes: determining the first sound source position based on an association relationship and a music feature, where the association relationship is used to represent the association between the music feature and the first sound source position.
  • the orientation and dynamics of the extracted specific sound-emitting objects are set according to the musical characteristics of the music, so that the 3D rendering is more natural and the artistry is better reflected.
  • the media information in the above steps includes media information, and when the media information includes text to be displayed in the multimedia file and the text contains position-related position text, Determining the position of the first sound source of the first sound-emitting object based on the reference information includes: identifying the position text; and determining the position of the first sound source based on the position text.
  • the reference information in the above steps includes reference position information, and when the reference position information includes the first position information, the first position of the first sounding object is determined based on the reference information.
  • the method further includes: acquiring first position information, where the first position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; determining the first sound source position of the first sound-emitting object based on the reference information , including: converting the first position information into the first sound source position.
  • the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation (ie, the first attitude angle) provided by the sensor.
  • the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • the reference information in the above steps includes reference position information, and when the reference position information includes the first position information, the first position of the first sounding object is determined based on the reference information.
  • the method further includes: acquiring first position information, where the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; determining the first sound source position of the first sound-emitting object based on the reference information, including: A position information is converted into the position of the first sound source.
  • the user can use the actual position information of the sensor as the sound source position to control the sound-emitting object and perform real-time or later dynamic rendering, and the movement trajectory of the sound-emitting object can simply be completely controlled by the user. Increase editing flexibility.
  • the reference information in the above steps includes reference position information, and when the reference position information includes the second position information, the first utterance object of the first sound is determined based on the reference information.
  • the method further includes: providing a spherical view for the user to select, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device; Position information; determining the position of the first sound source of the first sound-emitting object based on the reference information, including: converting the second position information into the position of the first sound source.
  • the user can select the second position information (such as clicking, dragging, sliding, etc.) through the spherical view to control the selected sound-emitting object and perform real-time or later dynamic rendering to give it a specific space Orientation and motion can realize interactive creation between users and audio, providing users with a new experience.
  • the second position information such as clicking, dragging, sliding, etc.
  • the above steps: acquiring the first single-object audio track based on the multimedia file includes: separating the first single-object audio track from the original audio track in the multimedia file. , the original audio track is obtained by synthesizing at least the first single-object audio track and the second single-object audio track, and the second single-object audio track corresponds to the second sounding object.
  • the original audio track is composed of at least the first single-object audio track and the second single-object audio track
  • Object building space rendering enhances the user's ability to edit audio, and can be applied to the object production of music and film and television works. Increase the user's control and playability of music.
  • separating the first single-object audio track from the original audio track in the multimedia file includes: separating the original audio track from the original audio track through a trained separation network. to separate out the first single-object track.
  • the first single-object audio track is separated through a separation network, and the The specific sound-emitting object can be spatially rendered to enhance the user's ability to edit audio, which can be applied to the object production of music and film and television works. Increase the user's control and playability of music.
  • the trained separation network in the above steps is to separate the target pair by taking the training data as the input of the separation network and taking the value of the loss function less than the first threshold as the target pair.
  • the network is trained and acquired, and the training data includes a training audio track.
  • the training audio track is obtained by at least combining the initial third single-object audio track and the initial fourth single-object audio track.
  • the initial third single-object audio track corresponds to the third vocal object
  • the initial The fourth single-object soundtrack corresponds to the fourth sounding object
  • the third sounding object is of the same type as the first sounding object
  • the second sounding object is of the same type as the fourth sounding object
  • the output of the separation network includes the separately obtained No.
  • Three single-object tracks; the loss function is used to indicate the difference between the separately obtained third single-object track and the original third single-object track.
  • the separation network is trained with the goal of reducing the value of the loss function, that is, the difference between the third single-object audio track output by the separation network and the initial third single-object audio track is continuously reduced. This makes the single-object audio track separated by the separation network more accurate.
  • the above steps perform spatial rendering on the first single-object audio track based on the position of the first sound source and the type of the playback device, including: if the playback device is an earphone, by The following formula obtains the first single-object audio track after rendering;
  • the above steps perform spatial rendering on the first single-object audio track based on the position of the first sound source and the type of the playback device, including: if the playback devices are N external Play the device, and obtain the rendered first single-object audio track through the following formula;
  • ⁇ i is the calibrator to calibrate the ith external speaker
  • ⁇ i is the inclination angle obtained by the calibrator to calibrate the ith external device
  • ri is the distance between the ith external device and the calibrator
  • N is a positive integer
  • i is a positive integer
  • i ⁇ N the position of the first sound source is within a tetrahedron formed by N external devices.
  • the above steps further include: obtaining the target audio track based on the rendered first single-object audio track, the original audio track in the multimedia file, and the type of the playback device. ; Send the target audio track to the playback device, which is used to play the target audio track.
  • the target audio track can be obtained, which facilitates saving the rendered audio track, facilitates subsequent playback, and reduces repeated rendering operations.
  • the above steps are based on the rendered first single-object audio track, the original audio track in the multimedia file, and the type of the playback device, and obtaining the target audio track includes: if The type of playback device is headphones, and the target audio track is obtained by the following formula:
  • i indicates left or right channel
  • X i (t) is the original track at time t
  • X i (t) is the original track at time t
  • a s (t) is the adjustment coefficient of the first sounding object at time t
  • hi ,s (t) is the left channel corresponding to the first sounding object at time t or Right channel head correlation transfer function
  • HRTF filter coefficients are related to the position of the first sound source
  • o s (t) is the first single-object track at time t
  • is the integral term
  • S 1 is the original The sounding object that needs to be replaced in the audio track, if the first sounding object is to replace the sounding object in the original audio track, then S1 is an empty set ;
  • S2 is the sounding object added by the target audio track compared to the original audio track, if If the first sounding object is the sounding object in the copied original audio track, then S2
  • the playback device when the playback device is an earphone, the technical problem of how to obtain the target audio track is solved. It is convenient to save the rendered audio track, which is convenient for subsequent playback and reduces repeated rendering operations.
  • the above steps obtain the target audio track based on the rendered first single-object audio track, the original audio track in the multimedia file, and the type of the playback device, including: If the type of playback device is N external devices, the target audio track is obtained by the following formula:
  • i indicates the ith channel in the multi-channel
  • X i (t) is the original track at time t
  • X i (t) is the original track at time t
  • a s (t) is the adjustment coefficient of the first sound-emitting object at time t
  • g s (t) represents the translation coefficient of the first sound-emitting object at time t
  • g i,s ( t) represents the ith row in g s (t)
  • o s (t) is the first single-object track at time t
  • S 1 is the sounding object that needs to be replaced in the original soundtrack.
  • S1 is an empty set
  • S2 is the sounding object added by the target soundtrack compared to the original soundtrack.
  • S 2 is an empty set
  • S 1 and/or S 2 are the sound-emitting objects of the multimedia file and include the first sound-emitting object
  • ⁇ i is the azimuth angle obtained by the calibrator calibrating the i-th external device
  • ⁇ i is the calibrator calibration
  • the inclination angle obtained by the ith external device, ri is the distance between the ith external device and the calibrator
  • N is a positive integer
  • i is a positive integer
  • i ⁇ N the first sound source is located in N external speakers within the tetrahedron formed by the device.
  • the playback device when the playback device is an external playback device, the technical problem of how to obtain the target audio track is solved. It is convenient to save the rendered audio track, which is convenient for subsequent playback and reduces repeated rendering operations.
  • the music features in the above steps include: at least one of music structure, music emotion, and singing mode.
  • the above steps further include: separating a second single-object audio track from the multimedia file; determining the second sound source position of the second sound-emitting object based on the reference information, The second single-object audio track is spatially rendered based on the second sound source position to obtain a rendered second single-object audio track.
  • At least two single-object audio tracks can be separated from the multimedia file, and corresponding spatial rendering can be performed to enhance the user's ability to edit specific sound-emitting objects in the audio, which can be applied to music, film and television works. object making. Increase the user's control and playability of music.
  • a second aspect of the embodiments of the present application provides a rendering method, which can be executed by a rendering device, and can be applied to scenes such as music, film and television production production, etc., or a component of the rendering device (such as a processor, a chip, or a chip system) etc.) execute.
  • the method includes: acquiring a multimedia file; acquiring a first single-object audio track based on the multimedia file, where the first single-object audio track corresponds to a first sounding object; displaying a user interface, the user interface including a rendering mode option; One operation, determine the automatic rendering mode or the interactive rendering mode from the rendering mode options; when the automatic rendering mode is determined, the first rendered single-object audio track is obtained based on the preset mode; or when the interactive rendering mode is determined , in response to the user's second operation to obtain reference position information; determine the first sound source position of the first sound-emitting object based on the reference position information; render the first single-object track based on the first sound source position to obtain the rendered 's first single-object track.
  • the rendering device determines the automatic rendering mode or the interactive rendering mode from the rendering mode options according to the first operation of the user.
  • the rendering device may automatically acquire the rendered first single-object audio track based on the user's first operation.
  • the spatial rendering of the audio track corresponding to the first sound-emitting object in the multimedia file can be realized through the interaction between the rendering device and the user, so as to provide the user with an immersive stereo sound effect.
  • the preset manner in the above steps includes: acquiring media information of the multimedia file; determining the position of the first sound source of the first sounding object based on the media information; The sound source position renders the first single-object audio track to obtain the rendered first single-object audio track.
  • the media information in the above steps includes: the text to be displayed in the multimedia file, the image to be displayed in the multimedia file, and the music of the music to be played in the multimedia file. at least one of a feature and a sound source type corresponding to the first sound-emitting object.
  • the rendering device determines the multimedia file to be processed through interaction with the user, thereby increasing the user's controllability and playability of the music in the multimedia file.
  • the reference location information in the above steps includes first location information of the sensor or second location information selected by the user.
  • the type of the playback device is determined through the user's operation. Different playback device types can correspond to different spatial rendering formulas, so that the spatial effect of the rendered audio track played by the playback device in the later stage is more realistic and accurate.
  • the above step when the media information includes an image and the image includes a first sound-emitting object, determining the first sound source position of the first sound-emitting object based on the media information, including: : present the image; determine third position information of the first sound-emitting object in the image, where the third position information includes the two-dimensional coordinates and depth of the first sound-emitting object in the image; obtain the first sound source position based on the third position information.
  • the rendering device may automatically present the image and determine the sound-emitting object in the image, obtain third position information of the sound-emitting object, and then obtain the position of the first sound source.
  • the rendering device can automatically identify the multimedia file, and when the multimedia file includes an image and the image includes the first sound-emitting object, the rendering device can automatically acquire the rendered first single-object audio track.
  • the 3D immersion is rendered in the headset or external environment, and the real sound moves with the picture, allowing users to obtain the best sound effect experience.
  • determining the third position information of the first sound-emitting object in the image includes: in response to a third operation by the user on the image, determining the location of the first sound-emitting object. third location information.
  • the user may select the first sound-emitting object from the plurality of sound-emitting objects in the presented image, that is, the user may select the first single-object audio track corresponding to the rendered first sound-emitting object.
  • the coordinates of the sounding object and the single-object soundtrack are extracted, and the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, so that the user can obtain the best sound effect experience.
  • the first sound source position of the first sounding object is determined based on the media information
  • the method includes: identifying the music feature; determining the first sound source position based on the association relationship and the music feature, and the association relationship is used to represent the association between the music feature and the first sound source position.
  • the orientation and dynamics of the extracted specific sound-emitting objects are set according to the musical characteristics of the music, so that the 3D rendering is more natural and the artistry is better reflected.
  • the first sounding object is determined based on the media information.
  • the position of the first sound source includes: identifying the position text; determining the position of the first sound source based on the position text.
  • obtaining the reference location information in response to a second operation by the user includes: responding to the user's response to the sensor.
  • the second operation is to acquire first position information, where the first position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; determining the position of the first sound source of the first sound-emitting object based on the reference position information, including: The first position information is converted into a first sound source position.
  • the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation (ie, the first attitude angle) provided by the sensor.
  • the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • obtaining the reference location information in response to a second operation by the user includes: responding to the user's response to the sensor.
  • the second operation is to acquire first position information, where the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; determining the position of the first sound source of the first sounding object based on the reference position information, including: converting the first position information is the position of the first sound source.
  • the user can use the actual position information of the sensor as the sound source position to control the sound-emitting object and perform real-time or later dynamic rendering, and the movement trajectory of the sound-emitting object can simply be completely controlled by the user. Increase editing flexibility.
  • obtaining the reference position information in response to the second operation of the user includes: presenting a spherical view, the ball The center of the view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device; in response to the user's second operation, the second position information is determined in the spherical view; based on the reference position information, the first sounding object is determined.
  • the first sound source position includes: converting the second position information into the first sound source position.
  • the user can select the second position information (such as clicking, dragging, sliding, etc.) through the spherical view to control the selected sound-emitting object and perform real-time or later dynamic rendering to give it a specific space Orientation and motion can realize interactive creation between users and audio, providing users with a new experience.
  • the second position information such as clicking, dragging, sliding, etc.
  • the above step: acquiring a multimedia file includes: determining a multimedia file from at least one stored multimedia file in response to a fourth operation of the user.
  • the multimedia file may be determined from at least one stored multimedia file based on the user's selection, and then the rendering and production of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file selected by the user is realized, Improve user experience.
  • the above-mentioned user interface further includes a playback device type option; the method further includes: in response to a fifth operation of the user, determining the type of the playback device from the playback device type option ; Render the first single-object soundtrack based on the first sound source position to obtain the rendered first single-object soundtrack, including: rendering the first single-object soundtrack based on the first sound source position and type, to obtain the rendered first single-object soundtrack Get the rendered first single-object audio track.
  • a rendering mode suitable for the playback device being used by the user is selected, thereby improving the rendering effect of the playback device and making the 3D rendering more natural.
  • the above steps: acquiring the first single-object audio track based on the multimedia file includes: separating the first single-object audio track from the original audio track in the multimedia file. , the original audio track is obtained by synthesizing at least the first single-object audio track and the second single-object audio track, and the second single-object audio track corresponds to the second sounding object.
  • the original audio track is composed of at least the first single-object audio track and the second single-object audio track
  • Object building space rendering enhances the user's ability to edit audio, and can be applied to the object production of music and film and television works. Increase the user's control and playability of music.
  • the first single-object audio track can be separated from the multimedia file, so as to render the single-object audio track corresponding to the specific sound-emitting object in the multimedia file, so as to improve the user's audio creation and improve the user's experience.
  • the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation or position provided by the sensor.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation (ie, the first attitude angle) provided by the sensor.
  • the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • the actual position information of the sensor is used as the position of the sound source to control the sound-emitting object and perform dynamic rendering in real time or later, so that the movement trajectory of the sound-emitting object can be simply and completely controlled by the user, which greatly increases editing. flexibility.
  • the user can control the selected sound-emitting object by dragging the interface and perform real-time or later dynamic rendering, giving it specific spatial orientation and motion, which can realize interactive creation between users and audio , to provide users with a new experience.
  • the rendering device can set the orientation and dynamics of the extracted specific sound-emitting object according to the musical characteristics of the music, so that the sound track corresponding to the sound-emitting object is more natural and artistic in 3D rendering. well represented.
  • the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, so that the user can obtain the best sound effect experience.
  • the rendering device can automatically track the sound-emitting object in the video after determining the sound-emitting object, and render the audio track corresponding to the sound-emitting object in the entire video, which can also be used in professional sound mixing post-production. Improve mixer productivity.
  • the rendering device may determine the sound-emitting object in the image according to the fourth operation of the user, track the sound-emitting object in the image, and render the audio track corresponding to the sound-emitting object.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • the music features in the above steps include: at least one of music structure, music emotion, and singing mode.
  • the above steps further include: separating a second single-object audio track from the original audio track; determining the second sound source position of the second sounding object based on the reference information , performing spatial rendering on the second single-object audio track based on the second sound source position to obtain a rendered second single-object audio track.
  • At least two single-object audio tracks can be separated from the original audio track, and corresponding spatial rendering can be performed to enhance the user's ability to edit specific sound-emitting objects in the audio, which can be applied to music, film and television works object production. Increase the user's control and playability of music.
  • a third aspect of the present application provides a rendering device, which can be applied to scenes such as music, film and television production, and the like, and the rendering device includes:
  • an acquisition unit configured to acquire a first single-object audio track based on the multimedia file, and the first single-object audio track corresponds to the first sounding object;
  • a determining unit configured to determine the first sound source position of the first sound-emitting object based on reference information, the reference information includes reference position information and/or media information of a multimedia file, and the reference position information is used to indicate the first sound source position;
  • the rendering unit is configured to perform spatial rendering on the first single-object audio track based on the position of the first sound source, so as to obtain the rendered first single-object audio track.
  • the above-mentioned media information includes: text to be displayed in the multimedia file, images to be displayed in the multimedia file, music characteristics of the music to be played in the multimedia file, and At least one of the sound source types corresponding to the first sound-emitting object.
  • the above-mentioned reference location information includes first location information of the sensor or second location information selected by the user.
  • the above-mentioned determining unit is also used to determine the type of the playback device, and the playback device is used to play the target audio track, and the target audio track is based on the rendered first audio track.
  • the object audio track is acquired; the rendering unit is specifically configured to perform spatial rendering on the first single-object audio track based on the position of the first sound source and the type of the playback device.
  • the above-mentioned reference information includes media information, and when the media information includes an image and the image includes a first sounding object, the determining unit is specifically configured to determine the first sound in the image.
  • Third position information of the sound-emitting object where the third position information includes two-dimensional coordinates and depth of the first sound-emitting object in the image; the determining unit is specifically configured to acquire the position of the first sound source based on the third position information.
  • the above-mentioned reference information includes media information
  • the media information includes the music characteristics of the music to be played in the multimedia file
  • the determining unit is specifically used for based on the association relationship.
  • the first sound source position is determined with the music feature, and the association relationship is used to represent the association between the music feature and the first sound source position.
  • the above-mentioned media information includes media information, and when the media information includes text that needs to be displayed in the multimedia file and the text contains position-related text, the determining unit , which is specifically used to identify the position characters; the determining unit is specifically used to determine the position of the first sound source based on the position characters.
  • the above-mentioned reference information includes reference position information, and when the reference position information includes the first position information, the obtaining unit is further configured to obtain the first position information, and the first position information is obtained.
  • the position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
  • the above-mentioned reference information includes reference position information, and when the reference position information includes the first position information, the obtaining unit is further configured to obtain the first position information, and the first position information is obtained.
  • the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
  • the above-mentioned reference information includes reference position information
  • the rendering device further includes: a providing unit for providing a spherical view For the user to choose, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device; the acquiring unit is also used to acquire the second position information selected by the user in the spherical view; the determining unit, Specifically, it is used to convert the second position information into the first sound source position.
  • the above-mentioned acquisition unit is specifically used to separate the first single-object audio track from the original audio track in the multimedia file, and the original audio track is composed of at least the first audio track.
  • the single-object audio track and the second single-object audio track are synthesized and obtained, and the second single-object audio track corresponds to the second sounding object.
  • the above obtaining unit is specifically configured to separate the first single-object audio track from the original audio track by using a trained separation network.
  • the above-mentioned trained separation network is performed by taking the training data as the input of the separation network and taking the value of the loss function less than the first threshold as the target to perform the separation network.
  • Training acquisition the training data includes a training audio track
  • the training audio track is obtained by at least synthesizing the initial third single-object audio track and the initial fourth single-object audio track
  • the initial third single-object audio track corresponds to the third sounding object
  • the initial fourth The single-object soundtrack corresponds to the fourth sounding object
  • the third sounding object is of the same type as the first sounding object
  • the second sounding object is of the same type as the fourth sounding object
  • the output of the separation network includes the third sound produced by separation.
  • Object track the loss function is used to indicate the difference between the separately obtained third single-object track and the original third single-object track.
  • the obtaining unit is specifically configured to obtain the rendered first single-object audio track by the following formula
  • the obtaining unit is specifically used to obtain the rendered first single-object audio track by the following formula
  • ⁇ i is the calibrator to calibrate the ith external speaker
  • ⁇ i is the inclination angle obtained by the calibrator to calibrate the ith external device
  • ri is the distance between the ith external device and the calibrator
  • N is a positive integer
  • i is a positive integer
  • i ⁇ N the position of the first sound source is within a tetrahedron formed by N external devices.
  • the above-mentioned obtaining unit is further configured to obtain the target track based on the rendered first single-object track and the original track in the multimedia file; rendering;
  • the device further includes: a sending unit, used for sending the target audio track to the playback device, and the playback device is used for playing the target audio track.
  • the obtaining unit is specifically configured to obtain the target audio track by the following formula:
  • i indicates left or right channel
  • X i (t) is the original track at time t
  • X i (t) is the original track at time t
  • a s (t) is the adjustment coefficient of the first sounding object at time t
  • hi is the left channel or right channel corresponding to the first sounding object at time t
  • HRTF filter coefficient of the channel is related to the position of the first sound source
  • o s (t) is the first single-object track at time t
  • is the integral term
  • S 1 is the original The sounding object that needs to be replaced in the audio track, if the first sounding object is to replace the sounding object in the original audio track, then S1 is an empty set ;
  • S2 is the sounding object added by the target audio track compared to the original audio track, if The first sounding object is the sounding object in the copied original audio track, then
  • the obtaining unit is specifically configured to obtain the target audio track by the following formula:
  • i indicates the ith channel in the multi-channel
  • X i (t) is the original track at time t
  • X i (t) is the original track at time t
  • a s (t) is the adjustment coefficient of the first sound-emitting object at time t
  • g s (t) represents the translation coefficient of the first sound-emitting object at time t
  • g i,s ( t) represents the ith row in g s (t)
  • o s (t) is the first single-object track at time t
  • S 1 is the sounding object that needs to be replaced in the original soundtrack.
  • S1 is an empty set
  • S2 is the sounding object added by the target soundtrack compared to the original soundtrack.
  • S 2 is an empty set
  • S 1 and/or S 2 are the sound-emitting objects of the multimedia file and include the first sound-emitting object
  • ⁇ i is the azimuth angle obtained by the calibrator calibrating the i-th external device
  • ⁇ i is the calibrator calibration
  • the tilt angle obtained by the ith external device, ri is the distance between the ith external device and the calibrator
  • N is a positive integer
  • i is a positive integer
  • i ⁇ N the first sound source is located outside N Put the device inside the tetrahedron.
  • a fourth aspect of the present application provides a rendering device, which can be applied to scenes such as music, film and television production, and the like, and the rendering device includes:
  • the obtaining unit is also used to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
  • a display unit for displaying a user interface, the user interface including rendering mode options
  • a determination unit used for determining the automatic rendering mode or the interactive rendering mode from the rendering mode options in response to the first operation of the user on the user interface;
  • the obtaining unit is further configured to obtain the rendered first single-object audio track based on the preset mode when the automatic rendering mode is determined by the determining unit; or
  • the obtaining unit is further configured to obtain the reference position information in response to the second operation of the user when the determination unit determines the interactive rendering mode; determine the first sound source position of the first sound-emitting object based on the reference position information; based on the first sound The source position renders the first single-object audio track to obtain the rendered first single-object audio track.
  • the above-mentioned preset manner includes: an acquiring unit, further configured to acquire media information of the multimedia file; and a determining unit, further configured to determine the first utterance based on the media information The first sound source position of the object; the obtaining unit is further configured to render the first single-object audio track based on the first sound source position, so as to obtain the rendered first single-object audio track.
  • the above-mentioned media information includes: text to be displayed in the multimedia file, images to be displayed in the multimedia file, music characteristics of the music to be played in the multimedia file, and At least one of the sound source types corresponding to the first sound-emitting object.
  • the above-mentioned reference location information includes first location information of the sensor or second location information selected by the user.
  • the determining unit when the above-mentioned media information includes an image and the image includes a first sound-emitting object, the determining unit is specifically configured to present the image; the determining unit is specifically configured to determine the image
  • the third position information of the first sound-emitting object in the image includes the two-dimensional coordinates and depth of the first sound-emitting object in the image; the determining unit is specifically configured to obtain the position of the first sound source based on the third position information.
  • the above-mentioned determining unit is specifically configured to determine the third position information of the first sound-emitting object in response to a third operation of the user on the image.
  • the determining unit is specifically used to identify the music feature
  • the determining unit is specifically configured to determine the position of the first sound source based on the association relationship and the music feature, and the association relationship is used to represent the association between the music feature and the position of the first sound source.
  • the determining unit is specifically used to identify the position. Text; a determining unit, specifically configured to determine the location of the first sound source based on the location text.
  • the determining unit when the above-mentioned reference position information includes the first position information, the determining unit is specifically configured to respond to the user's second operation on the sensor to obtain the first position information.
  • the first position information includes the first attitude angle of the sensor and the distance between the sensor and the playback device; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
  • the determining unit when the above-mentioned reference position information includes the first position information, the determining unit is specifically configured to respond to the user's second operation on the sensor to obtain the first position information.
  • the first position information includes the second attitude angle of the sensor and the acceleration of the sensor; the determining unit is specifically configured to convert the first position information into the position of the first sound source.
  • the determining unit when the above-mentioned reference position information includes the second position information, the determining unit is specifically configured to present a spherical view, and the center of the spherical view is the position of the user, The radius of the spherical view is the distance between the user's position and the playback device; the determining unit is specifically used to respond to the user's second operation, and determine the second position information in the spherical view; the determining unit is specifically used to convert the second position information into The position of the first sound source.
  • the above obtaining unit is specifically configured to determine a multimedia file from at least one stored multimedia file in response to a fourth operation of the user.
  • the above-mentioned user interface further includes a playback device type option; the determining unit is further configured to respond to the fifth operation of the user, and determine the playback device from the playback device type option.
  • the type of the ; obtaining unit which is specifically configured to render the first single-object audio track based on the position and type of the first sound source, so as to obtain the rendered first single-object audio track.
  • the above-mentioned acquisition unit is specifically used to separate the first single-object audio track from the original audio track in the multimedia file, and the original audio track is composed of at least the first audio track.
  • the single-object audio track and the second single-object audio track are synthesized and obtained, and the second single-object audio track corresponds to the second sounding object.
  • the above-mentioned music features include: at least one of music structure, music emotion, and singing mode.
  • the above-mentioned obtaining unit is further configured to separate a second single-object soundtrack from the multimedia file; determine the second sound source position of the second sounding object, based on the The second sound source position spatially renders the second single-object audio track to obtain the rendered second single-object audio track.
  • a fifth aspect of the present application provides a rendering device, the rendering device executes the foregoing first aspect or the method in any possible implementation manner of the first aspect, or executes the foregoing second aspect or any possible implementation manner of the second aspect method in .
  • a sixth aspect of the present application provides a rendering device, including: a processor, where the processor is coupled to a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the rendering device implements the above-mentioned first aspect Or the method in any possible implementation manner of the first aspect, or cause the rendering device to implement the second aspect or the method in any possible implementation manner of the second aspect.
  • a seventh aspect of the present application provides a computer-readable medium on which a computer program or instruction is stored, when the computer program or instruction is executed on a computer, the computer can execute the first aspect or any possible implementation of the first aspect. method, or causing a computer to perform the method in the foregoing second aspect or any possible implementation manner of the second aspect.
  • An eighth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the first aspect or any possible implementation manner of the first aspect, the second aspect or any of the second aspect. methods in possible implementations.
  • the embodiments of the present application have the following advantages: obtaining a first single-object audio track based on a multimedia file, where the first single-object audio track corresponds to the first sounding object; determining the first soundtrack of the first sounding object based on reference information
  • the sound source position based on the first sound source position, spatially renders the first single-object audio track to obtain the rendered first single-object audio track.
  • the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects.
  • FIG. 1 is a schematic structural diagram of a system architecture provided by the application.
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by the application.
  • FIG. 3 is a schematic diagram of another convolutional neural network structure provided by the application.
  • FIG. 4 is a schematic diagram of a chip hardware structure provided by the application.
  • FIG. 5 is a schematic flowchart of a method for training a separation network provided by the application
  • FIG. 6 is a schematic structural diagram of a separation network provided by the application.
  • FIG. 7 is a schematic structural diagram of another separation network provided by the application.
  • FIG. 9 is a schematic diagram of an application scenario provided by the present application.
  • FIG. 10 is a schematic flowchart of a rendering method provided by the application.
  • FIG. 11 is a schematic flowchart of a playback device calibration method provided by the application.
  • FIG. 18 is a schematic diagram of the orientation of a mobile phone provided by the application.
  • FIG. 19 is another schematic diagram of the display interface of the rendering device provided by the application.
  • 20 is a schematic diagram of determining the position of a sound source using a mobile phone provided by the present application.
  • 21-47 are other schematic diagrams of the display interface of the rendering device provided by the present application.
  • FIG. 48 is a schematic structural diagram of the external device system provided by the application in a spherical coordinate system
  • 49-50 are several schematic diagrams of sharing rendering rules between users provided by the present application.
  • 51-53 are other schematic diagrams of the display interface of the rendering device provided by the present application.
  • FIG. 54 is a schematic diagram of user interaction under the sound hunter game scene provided by this application.
  • 55-57 are several schematic diagrams of user interaction in a multi-person interaction scenario provided by the present application.
  • 58-61 are schematic diagrams of several structures of the rendering device provided by this application.
  • FIG. 62 is a schematic structural diagram of the sensor device provided by the application.
  • the embodiment of the present application provides a rendering method, which can improve the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file, and provide users with immersive stereoscopic sound effects.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Of course, the deep neural network may also not include hidden layers, which is not limited here.
  • the work of each layer in a deep neural network can be expressed mathematically To describe: from the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column through five operations on the input space (set of input vectors). Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by done, the operation of 4 consists of Completed, the operation of 5 is implemented by ⁇ ().
  • W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • This vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training the deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or a convolutional feature map.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can acquire reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the separation network, the identification network, the detection network, the depth estimation network and other networks in the embodiments of the present application may all be CNNs.
  • a recurrent neural network refers to a sequence where the current output is also related to the previous output. The specific manifestation is that the network will memorize the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.
  • HRTF Head related transfer function
  • An audio track is a track for recording audio data, and each audio track has one or more attribute parameters, and the attribute parameters include audio format, bit rate, dubbing language, sound effect, number of channels, volume and so on.
  • Tracks can be single or multi-track (or called mixed tracks).
  • a single audio track may correspond to one or more sounding objects, and a multi-audio track includes at least two single audio tracks.
  • a single-object track corresponds to one sounding object.
  • short-time fourier transform short-time fourier transform
  • STFT short-term fourier transform
  • an embodiment of the present invention provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data includes: a multimedia file, where the multimedia file includes an original audio track, and the original audio track corresponds to at least one sounding object.
  • the training data is stored in the database 130 , and the training device 120 trains and obtains the target model/rule 101 based on the training data maintained in the database 130 .
  • Embodiment 1 will be used to describe how the training device 120 obtains the target model/rule 101 based on the training data in more detail below.
  • the target model/rule 101 can be used to implement the rendering method provided by this embodiment of the present application, wherein the target model/rule 101 There are multiple situations.
  • a case of the target model/rule 101 (when the target model/rule 101 is the first model), the multimedia file is input into the target model/rule 101, and the first single object sound corresponding to the first vocal object can be obtained rail.
  • the target model/rule 101 when the target model/rule 101 is the second model, the multimedia file is input into the target model/rule 101 after relevant preprocessing, and the corresponding object corresponding to the first sound can be obtained. 's first single-object track.
  • the target model/rule 101 in this embodiment of the present application may specifically include a separation network, and may further include a recognition network, a detection network, a depth estimation network, and the like, which are not specifically limited here.
  • the separation network is obtained by training training data.
  • the training data maintained in the database 130 may not necessarily all come from the collection of the data collection device 160, and may also be received and acquired from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 obtained by training according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 110 is configured with an I/O interface 112, which is used for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and the input data is described in the embodiments of the present application. It can include: multimedia files, which can be input by the user, or uploaded by the user through an audio device, and of course can also come from a database, which is not limited here.
  • the preprocessing module 113 is configured to perform preprocessing according to the multimedia file received by the I/O interface 112.
  • the preprocessing module 113 may be configured to perform short-time Fourier transform processing on the audio track in the multimedia file. , the acquisition time spectrum.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , and the data, instructions, etc. obtained by the corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the first single-object audio track corresponding to the first sounding object obtained above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the target model/rule 101 is obtained by training according to the training device 120, and the target model/rule 101 may be a separate network in this embodiment of the present application.
  • the separate network It can be a convolutional neural network or a recurrent neural network.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .
  • the convolutional/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • layer 124 is a convolutional layer.
  • Layers are pooling layers
  • 125 are convolutional layers
  • 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels...depending on the value of stride), which completes the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification...
  • the dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer for example, 121
  • the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
  • each layer 121-126 exemplified by 120 in Figure 2 can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of the pixel values in the image within a certain range.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types For example, the task type can include multi-track separation, image recognition, image classification, image super-resolution reconstruction and so on.
  • the task type can include multi-track separation, image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 2, the propagation from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 2 from 140 to 110 as the back propagation) will start to update.
  • the weight values and deviations of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 100 shown in FIG. 2 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, such as
  • the multiple convolutional layers/pooling layers shown in FIG. 3 are in parallel, and the extracted features are input to the full neural network layer 130 for processing.
  • FIG. 4 is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor 40 .
  • the chip can be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.
  • the neural network processor 40 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and tasks are allocated by the main CPU.
  • the core part of the NPU is the operation circuit 403, and the controller 404 controls the operation circuit 403 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit 403 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 403 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 402 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 401 to perform matrix operation, and stores the partial result or final result of the obtained matrix in the accumulator 408 .
  • the vector calculation unit 407 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like.
  • the vector calculation unit 407 can be used for network calculation of non-convolutional/non-FC layers in the neural network, such as pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), etc. .
  • the vector computation unit 407 can store the processed output vectors to the unified buffer 406 .
  • the vector calculation unit 407 may apply a nonlinear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 407 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 403, such as for use in subsequent layers in a neural network.
  • Unified memory 406 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 401 and/or the unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (bus interface unit, BIU) 410 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 409 through the bus.
  • An instruction fetch buffer 409 connected to the controller 404 is used to store the instructions used by the controller 404.
  • the controller 404 is used for invoking the instructions cached in the memory 409 to control the working process of the operation accelerator.
  • the unified memory 406, the input memory 401, the weight memory 402 and the instruction fetch memory 409 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access Memory
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 2 or FIG. 3 may be performed by the operation circuit 403 or the vector calculation unit 407 .
  • the training method of the separation network is introduced in detail with reference to FIG. 5 .
  • the method shown in FIG. 5 can be performed by a training device of a separate network, which can be a cloud service device or a terminal device, for example, a computer, a server, etc., whose computing power is sufficient to perform the training of the separate network
  • the apparatus of the method may also be a system composed of a cloud service device and a terminal device.
  • the training method may be performed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .
  • the training method may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computation may be used without using the GPU, which is not limited in this application.
  • a separation network can be used to separate the original audio track to obtain at least one single-object audio track.
  • the original audio track in the multimedia file corresponds to only one sounding object, the original audio track is a single-object audio track, and a separation network does not need to be used for separation.
  • the training method may include steps 501 and 502 . Steps 501 and 502 are described in detail below.
  • Step 501 Acquire training data.
  • the training data in the embodiment of the present application is obtained by synthesizing at least the initial third single-object audio track and the initial fourth single-object audio track, and it can also be understood that the training data includes a multi-tone synthesized by the single-object audio tracks corresponding to at least two sounding objects rail.
  • the initial third single-object audio track corresponds to the third vocal object
  • the initial fourth single-object audio track corresponds to the fourth vocal object.
  • the training data may also include images matching the original audio track, wherein the training data may also be A multimedia file, the multimedia file includes the above-mentioned multiple audio tracks, and the multimedia file may include a video track or a text track (or a bullet screen track) in addition to the audio track, which is not specifically limited here.
  • the audio tracks (original audio tracks, first single-object audio tracks, etc.) in the embodiments of the present application may include vocal tracks, musical instrument tracks (eg, drum tracks, piano tracks, trumpet tracks, etc.), airplane sounds, etc.
  • the audio track generated by the object is not limited here to the specific sounding object corresponding to the audio track.
  • the training data may be obtained by directly recording the sound of the sounding object, or by the user inputting audio information and video information, or by receiving the transmission from the acquisition device.
  • the specific way of obtaining training data is not limited here.
  • Step 502 take the training data as the input of the separation network, train the separation network with the value of the loss function less than the first threshold as the target, and obtain the trained separation network.
  • the separation network in the embodiments of the present application may be called a separation neural network, a separation model, a separation neural network model, or the like, which is not specifically limited here.
  • the loss function is used to indicate the difference between the separately obtained third single-object audio track and the original third single-object audio track.
  • the separation network is trained with the goal of reducing the value of the loss function, that is, the difference between the third single-object audio track output by the separation network and the initial third single-object audio track is continuously reduced.
  • This training process can be understood as a separation task.
  • the loss function can be understood as the loss function corresponding to the separation task.
  • the output (at least one single-object audio track) of the separation network is a single-object audio track corresponding to at least one sound-emitting object in the input (audio track).
  • the third sound-emitting object belongs to the same type as the first sound-emitting object, and the second sound-emitting object belongs to the same type as the fourth sound-emitting object.
  • the first vocal object and the third vocal object are both human voices, but the first vocal object may be user A, and the second vocal object may be user B.
  • the third single-object audio track and the first single-object audio track are audio tracks corresponding to sounds uttered by different people.
  • the third sound-emitting object and the first sound-emitting object in this embodiment of the present application may be two sound-emitting objects of the same type, or may be one sound-emitting object of the same type, which is not specifically limited here.
  • the training data input by the separation network includes the original audio tracks corresponding to at least two sounding objects, and the separation network can output a single-object soundtrack corresponding to a certain sounding object among the at least two sounding objects, and can also output at least two sounding objects.
  • Objects respectively correspond to single-object tracks.
  • the multimedia file includes a sound track corresponding to a human voice, a sound track corresponding to a piano, and a sound track corresponding to a car sound.
  • a single-object audio track for example, a single-object audio track corresponding to a human voice
  • two single-object audio tracks for example, a single-object audio track corresponding to a human voice and a car sound corresponding to the voice
  • single-object track or three single-object tracks.
  • the separation network is shown in Figure 6, and the separation network includes a one-dimensional convolution and a residual structure. Among them, adding a residual structure can improve the gradient transfer efficiency. Of course, splitting the network may also include activation. pooling, etc.
  • the specific structure of the separation network is not limited here.
  • the separation network shown in Figure 6 takes the signal source (that is, the signal corresponding to the audio track in the multimedia file) as the input, transforms it through multiple convolutions and deconvolutions, and outputs the object signal (a single audio track corresponding to a sounding object) .
  • the time series correlation can also be improved by adding a recurrent neural network module, and the connection between high and low dimensional features can be improved by connecting different output layers.
  • the separation network is shown in Figure 7.
  • the signal source can be preprocessed, for example, STFT mapping is performed on the signal source to obtain the time spectrum.
  • the amplitude spectrum in the time spectrum is transformed through two-dimensional convolution and deconvolution to obtain a masked spectrum (the screened spectrum), and the masked spectrum and the amplitude spectrum are combined to obtain the target amplitude spectrum.
  • iSTFT inverse short-time Fourier transform
  • the connection between high and low-dimensional features can also be improved by connecting different output layers
  • the gradient transfer efficiency can be improved by adding a residual structure
  • the time series correlation can be improved by adding a recurrent neural network module.
  • the input in FIG. 6 can also be understood as a one-dimensional time domain signal, and the input in FIG. 7 is a two-dimensional time spectrum signal.
  • the above two separation models are just examples, and in practical applications, there are other possible structures.
  • the input of the separation model can be a time-domain signal
  • the output can be a time-domain signal
  • the input of the separation model can be a time-frequency domain signal
  • the output is a time-frequency domain signal, etc.
  • the structure, input or output of the separation model are not detailed here. limited.
  • the multi-track in the multimedia file can also be identified through the identification network, and it is recognized that the multi-track includes the number of tracks and object categories (for example: vocals, drum sounds, etc.). ), which can reduce the training time of the separation network.
  • the separation network may also include an identification sub-network for identifying multiple audio tracks, which is not specifically limited here.
  • the input of the recognition network can be a time domain signal
  • the output is a class probability. It is equivalent to inputting the time domain signal into the recognition network, obtaining the probability that the object is a certain category, and selecting the category whose probability exceeds the threshold as the category of classification.
  • the object here can also be understood as a sounding object.
  • the input in the above-mentioned identification network is a multimedia file synthesized by audios corresponding to vehicle A and vehicle B, and the multimedia file is input into the identification network, and the identification network can output the category of vehicle.
  • the recognition network can also identify the type of specific car, which is equivalent to further fine-grained recognition.
  • the identification network is set according to actual needs, which is not limited here.
  • the training process may not adopt the aforementioned training method but adopt other training methods, which is not limited here.
  • the system architecture includes input module, function module, database module and output module. Each module is described in detail below:
  • the input module includes a database option sub-module, a sensor information acquisition sub-module, a user interface input sub-module and a file input sub-module.
  • the above four sub-modules can also be understood as four ways of input.
  • the database options submodule is used for spatial rendering according to the rendering method stored in the database selected by the user.
  • the sensor information acquisition sub-module is used to specify the spatial position of a specific sound-emitting object according to the sensor (which may be a sensor in the rendering device, or another sensor device, which is not limited here). Select the position of a specific sounding object.
  • the user interface input sub-module is used to determine the spatial position of the specific sounding object in response to the user's operation on the user interface.
  • the user can control the spatial position of the specific sounding object by clicking, dragging and so on.
  • the file input sub-module is used to track the specific sounding object according to the image information or text information (for example: lyrics, subtitles, etc.), and then determine the spatial position of the specific sounding object according to the tracked position of the specific sounding object.
  • image information or text information for example: lyrics, subtitles, etc.
  • the functional modules include a signal transmission submodule, an object identification submodule, a calibration submodule, an object tracking submodule, an orientation calculation submodule, an object separation submodule, and a rendering submodule.
  • the signal transmission sub-module is used for receiving and sending information. Specifically, it may receive input information from the input module, and output feedback information to other modules.
  • the feedback information includes information such as position transformation information of a specific sounding object, a separated single-object audio track, and the like.
  • the signal transmission submodule may also be used to feed back the identified object information to the user through a user interface (user interface, UI), etc., which is not specifically limited here.
  • the object recognition sub-module is used to identify all the object information of the multi-track information sent by the input module to the signal transmission sub-module.
  • the object here refers to the sounding object (or called the sounding object), such as human voice, drum sound, and aircraft sound. Wait.
  • the object identification sub-module may be the identification network described in the embodiment shown in FIG. 5 or the identification sub-network in the separation network.
  • the calibration sub-module is used to calibrate the initial state of the playback device.
  • the calibration sub-module is used for earphone calibration
  • the calibration sub-module is used for external device calibration.
  • the initial state of the sensor (the relationship between the sensor device and the playback device will be described in Figure 9) can be defaulted to be the front, and subsequent corrections are made through the front. It is also possible to obtain the real position where the user visits the sensor to ensure that the front of the sound image is directly in front of the earphone.
  • the object tracking sub-module is used to track the motion trajectory of a specific sounding object.
  • the specific sounding object may be a sounding object in a text or image displayed in a multimodal file (eg, audio information, video information corresponding to the audio information, audio information, and text information corresponding to the audio information, etc.).
  • the object tracking sub-module may also be used to render motion trajectories on the audio side.
  • the object tracking sub-module may further include a target recognition network and a depth estimation network, the target recognition network is used to identify a specific vocal object to be tracked, and the depth estimation network is used to obtain the relative coordinates of the specific vocal object in the image (subsequent implementation). This will be described in detail in the example), so that the object tracking sub-module renders the azimuth and motion trajectory of the audio corresponding to the specific sounding object according to the relative coordinates.
  • the orientation calculation sub-module is used to convert the information obtained by the input module (for example: sensor information, input information of the UI interface, file information, etc.) into orientation information (also referred to as sound source position).
  • orientation information also referred to as sound source position.
  • the object separation sub-module is used to separate the multimedia file (or called multimedia information) or multi-track information into at least one single-object track. For example: extracting individual vocal tracks (i.e. vocal-only audio files) from a song.
  • the object separation sub-module may be the separation network in the embodiment shown in FIG. 5 above. Further, the structure of the object separation sub-module may be as shown in FIG. 6 or FIG. 7 , which is not specifically limited here.
  • the rendering submodule is used to obtain the sound source position obtained by the bearing calculation submodule, and perform spatial rendering on the sound source position. Further, the corresponding rendering method may be determined according to the playback device selected by the input information of the UI in the input module. Different playback devices have different rendering methods, and the rendering process will be described in detail in subsequent embodiments.
  • the database module includes a database selection submodule, a rendering rule editing submodule, and a rendering rule sharing submodule.
  • the database selection submodule is used to store rendering rules.
  • the rendering rule may be a rendering rule for converting a default two-channel/multi-channel audio track into a three-dimensional (three dimensional, 3D) spatial sense when the system is initialized, or a rendering rule saved by the user.
  • different objects may correspond to the same rendering rule, or different objects may correspond to different rendering rules.
  • the rendering rule editing submodule is used to re-edit the saved rendering rules.
  • the saved rendering rule may be a rendering rule stored in the database selection sub-module, or may be a newly input rendering rule, which is not specifically limited here.
  • the rendering rules sharing submodule is used to upload rendering rules to the cloud, and/or to download specific rendering rules from the rendering rules database in the cloud.
  • the rendering rule sharing module can upload user-defined rendering rules to the cloud and share them with other users. Users can select the rendering rules shared by other users that match the multi-track information to be played from the rendering rules database stored in the cloud, and download them to the database on the terminal side as the data files for the audio 3D rendering rules.
  • the output module is used to play the rendered single-object audio track or the target audio track (obtained from the original audio track and the rendered single-object audio track) through the playback device.
  • the application scenario includes a control device 901 , a sensor device 902 and a playback device 903 .
  • the playback device 903 in this embodiment of the present application may be an external device, an earphone (such as an in-ear earphone, a headphone, etc.), or a large screen (such as a projection screen), etc., which is not specifically limited here. .
  • connection between the control device 901 and the sensor device 902, and between the sensor device 902 and the playback device 903 can be connected through wired, wireless fidelity (WIFI), mobile data network or other connection methods, which is not specifically done here. limited.
  • WIFI wireless fidelity
  • mobile data network or other connection methods, which is not specifically done here. limited.
  • the control device 901 in this embodiment of the present application is a terminal device used to serve users, and the terminal device may include a head mount display (HMD), which may be a virtual reality (VR) ) combination of box and terminal, VR all-in-one machine, personal computer (PC) VR, augmented reality (AR) device, mixed reality (mixed reality, MR) device, etc.
  • the terminal device may also include a cellular phone (cellular phone), smart phone (smart phone), personal digital assistant (personal digital assistant, PDA), tablet computer, laptop computer (laptop computer), personal computer (personal computer, PC), vehicle terminal, etc.
  • cellular phone cellular phone
  • smart phone smart phone
  • PDA personal digital assistant
  • tablet computer laptop computer
  • laptop computer laptop computer
  • personal computer personal computer, PC
  • vehicle terminal etc.
  • the sensor device 902 in this embodiment of the present application is a device for sensing orientation and/or position, and may be a laser pointer, a mobile phone, a smart watch, a smart bracelet, or a device with an inertial measurement unit (IMU). , devices with simultaneous localization and mapping (simultaneous localization and mapping, SLAM) sensors, etc., which are not limited here.
  • IMU inertial measurement unit
  • the playback device 903 in this embodiment of the present application is a device used to play audio or video, and may be an external playback device (for example, a sound box, a terminal device with the function of playing audio or video), or an internal playback device (for example, : in-ear headphones, headsets, AR devices, VR devices, etc.), etc., which are not limited here.
  • an external playback device for example, a sound box, a terminal device with the function of playing audio or video
  • an internal playback device for example, : in-ear headphones, headsets, AR devices, VR devices, etc.
  • each device in the application scenario shown in FIG. 9 may be one or more, for example, there may be multiple external devices, and the number of each device is not specifically limited here.
  • control device, sensor device, and playback device in the embodiments of the present application may be three devices, two devices, or one device, which is not specifically limited here.
  • control device and the sensor device in the application scenario shown in FIG. 9 are the same device.
  • the control device and the sensor device are the same mobile phone, and the playback device is the headset.
  • the control device and the sensor device are the same mobile phone, and the playback device is an external device (also called an external device system, and the external device system includes one or more external devices).
  • control device and the playback device in the application scenario shown in FIG. 9 are the same device.
  • the control device and the playback device are the same computer.
  • the control device and the playback device are the same large screen.
  • control device, the sensor device, and the playback device in the application scenario shown in FIG. 9 are the same device.
  • the control device, sensor device, and playback device are the same tablet.
  • an embodiment of the rendering method provided by the embodiment of the present application may be executed by a rendering device, or may be executed by a component of the rendering device (for example, a processor, a chip, or a chip system, etc.).
  • the embodiment includes: Steps 1001 to 1004.
  • the rendering device may have the function of the control device, the function of the sensor device, and/or the function of the playback device as shown in FIG. 9 , which is not specifically limited here.
  • the rendering method is described below by taking the rendering device as a control device (such as a notebook), the sensor device as a device with an IMU (such as a mobile phone), and the playback device as an external device (such as a speaker).
  • the sensor described in the embodiments of the present application may refer to a sensor in a rendering device, or may refer to a sensor in a device other than the rendering device (eg, the aforementioned sensor device), which is not specifically limited here.
  • Step 1001 calibrating the playback device. This step is optional.
  • the playback device may be calibrated before the playback device plays the rendered audio track, and the purpose of the calibration is to improve the authenticity of the spatial effect of the rendered audio track.
  • a method for calibrating a playback device includes steps 1 to 5 .
  • the mobile phone held by the user establishes a connection with the external playback device.
  • the connection method is similar to the connection between the sensor device and the playback device in the embodiment shown in FIG. 9 , and details are not described here.
  • Step 1 Determine the playback device type.
  • the rendering device may determine the playback device type through a user's operation, may adaptively detect the playback device type, may also determine the playback device type by default settings, or may determine the playback device type in other ways, specifically here Not limited.
  • the rendering device may display an interface as shown in FIG. 12 , where the interface includes an icon for selecting a playback device type.
  • the interface may also include an icon for selecting an input file, an icon for selecting a rendering method (ie, reference information option), a calibration icon, a sound hunter icon, an object bar, volume, time progress, and a sphere view (or 3D view).
  • the user can click on the “Select Playback Device Type Icon” 101 .
  • FIG. 12 the interface includes an icon for selecting a playback device type.
  • the interface may also include an icon for selecting an input file, an icon for selecting a rendering method (ie, reference information option), a calibration icon, a sound hunter icon, an object bar, volume, time progress, and a sphere view (or 3D view).
  • the rendering device displays a drop-down menu, and the drop-down menu may include “external device options” and “headphone options”. Further, the user can click on the "external playback device option" 102 to further determine that the type of the playback device is an external playback device. As shown in FIG. 15 , in the interface displayed by the rendering device, “select the playback device type” may be replaced by “external playback device” to prompt the user that the current playback device type is the external playback device. It can also be understood that the rendering device displays the interface shown in FIG. 12 , the rendering device receives the fifth operation of the user (ie the click operation as shown in FIG. 13 and FIG. 14 ), and the rendering device responds to the fifth operation and sends the data from the playback device. In the type option, select the playback device type as external device.
  • this method is for calibrating the playback device, as shown in FIG. 16, the user can also click on the “calibration icon” 103, as shown in FIG. 17, the rendering device responds to the click operation and displays a drop-down menu, which may include “ Default Options" and "Manual Calibration Options". Further, the user can click the "manual calibration option" 104, and then determine that the calibration method is automatic calibration, and the automatic calibration can be understood as the user calibrating the playback device using a mobile phone (ie, a sensor device).
  • a mobile phone ie, a sensor device
  • the drop-down menu of "Select Playback Device Type Icon” includes "External Device Options" and "Headphone Options" as an example.
  • the drop-down menu may also include specific types of headphones, such as headset options. Options such as earphones, in-ear earphones, wired earphones, and Bluetooth earphones are not limited here.
  • the pull-down menu of “calibration icon” includes “default options” and “manual calibration options” as an example. In practical applications, the pull-down menu may also include other types of options, which are not limited here.
  • Step 2. Determine the test audio.
  • the test audio in this embodiment of the present application may be a test signal set by default (for example, pink noise), or it may be a person separated from a song (that is, a multimedia file is a song) through the separation network in the above-mentioned embodiment shown in FIG. 5
  • the single-object audio track corresponding to the sound may also be audio corresponding to other single-object audio tracks in the song, or may be audio including only the single-object audio track, etc., which is not specifically limited here.
  • the user may click the "select input file icon" on the interface displayed by the rendering device to select the test audio.
  • Step 3 Obtain the attitude angle of the mobile phone and the distance between the sensor and the external device.
  • the external device plays the test audio in sequence, and the user holds the sensor device (eg, mobile phone) to point to the external device that is playing the test audio.
  • the mobile phone After the mobile phone is stabilized, record the current orientation of the mobile phone and the received signal energy of the test audio, and calculate the distance between the mobile phone and the external device according to the following formula 1.
  • the stability of the mobile phone placement can be understood as within a period of time (for example, 200 milliseconds), the variance of the mobile phone orientation is less than a threshold (for example, 5 degrees).
  • the first external playback device plays the test audio first, and the user holds the mobile phone and points to the first external playback device. After the calibration of the first external device is completed, the user holds the mobile phone and points to the second external device for calibration.
  • the orientation of the mobile phone in the embodiment of the present application may refer to the attitude angle of the mobile phone, and the attitude angle may include an azimuth angle and a tilt angle (or a tilt angle), or the attitude angle includes an azimuth angle, a tilt angle, and a pitch angle.
  • the azimuth angle represents the angle around the z-axis
  • the tilt angle represents the angle around the y-axis
  • the pitch angle represents the angle around the x-axis.
  • the relationship between the orientation of the mobile phone and the x-axis, y-axis, and z-axis can be shown in Figure 18.
  • the playback devices are two external playback devices, the first external playback device plays the test audio first, the user holds the mobile phone and points to the first external playback device, records the current orientation of the mobile phone and receives the test audio. The signal energy of the audio.
  • the second external playback device plays the test audio, and the user holds the mobile phone and points to the second speaker to record the current orientation of the mobile phone and the signal energy of the test audio received.
  • the rendering device can display the interface shown in FIG. 19, wherein the right side of the interface is a spherical view, and the external device that has been calibrated and the external device that is being calibrated can be displayed in the spherical view. put the device.
  • an uncalibrated external device (not shown in the figure) may also be displayed, which is not specifically limited here.
  • the center of the spherical view is the position of the user (it can also be understood as the position where the user holds the mobile phone.
  • the position of the mobile phone is similar to the position of the user), and the radius can be the position of the user (or the position of the mobile phone) and the external
  • the distance of the device may also be a default value (for example, 1 meter), etc., which is not specifically limited here.
  • FIG. 20 an example effect diagram of a user holding a mobile phone and facing an external device.
  • the number of external playback devices is N, and N is a positive integer.
  • the i-th external playback device refers to a certain external playback device among the N external playback devices, where i is a positive integer and i ⁇ N .
  • the formulas in the embodiments of the present application are all calculated by taking the i-th external device as an example, and the calculation of other external devices is similar to the calculation of the i-th external device.
  • Formula 1 used to calibrate the i-th external device can be described as follows:
  • x(t) is the energy of the test signal received by the mobile phone at time t
  • X(t) is the energy of the test signal played by the external device at time t
  • t is a positive number
  • ri is the mobile phone and the i -th external device.
  • the distance between devices since the user holds the mobile phone, it can also be understood as the distance between the user and the i-th external device
  • rs is the normalized distance, which can be understood as a coefficient , which is used to convert the ratio of x(t) to X(t) into distance.
  • the coefficient can be set according to the actual external device.
  • the specific value of rs is not limited here.
  • test signals are played in sequence and directed toward the external playback devices, and the distance is obtained by formula 1.
  • Step 4 Determine the position information of the external device based on the attitude angle and the distance.
  • step 3 the mobile phone has recorded the attitude angle of the mobile phone towards each external device and calculated the distance between the mobile phone and each external device by formula 1.
  • the mobile phone can also send the measured sum to the rendering device, and the rendering device calculates the distance between the mobile phone and each external device through formula 1, which is not specifically limited here.
  • the rendering device After the rendering device obtains the attitude angle of the mobile phone and the distance between the mobile phone and the external device, the attitude angle of the mobile phone and the distance between the mobile phone and the external device can be converted into the position of the external device in the spherical coordinate system by formula 2 information, the location information includes azimuth, tilt, and distance (ie, the distance between the sensor device and the playback device).
  • the location information includes azimuth, tilt, and distance (ie, the distance between the sensor device and the playback device).
  • ⁇ (t) is the azimuth angle of the ith external device in the spherical coordinate system at time t
  • ⁇ (t) is the inclination angle of the ith external device in the spherical coordinate system at time t
  • d( t) is the distance between the mobile phone and the i-th external device
  • ⁇ (t)[0] is the azimuth angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the z-axis)
  • ⁇ (t)[1] is The pitch angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the x-axis)
  • ri is the distance calculated by formula 1, sign represents positive and negative values, if ⁇ (t)[1] is positive, then sign is positive ;If ⁇ (t)[1] is negative, sign is negative; %360 is used to adjust the angle range to 0°-360°, for example: if the angle of ⁇ (t)
  • the rendering device may display an interface as shown in FIG. 21 , the interface displays a “calibrated icon”, and the position of the calibrated playback device may be displayed in the right spherical view.
  • the problem of calibrating irregular external playback devices can be solved, allowing the user to obtain the spatial positioning of each external playback device in subsequent operations, so as to accurately render the position required for the single-object audio track, and improve the rendering audio track. the authenticity of the spatial effect.
  • Step 1002 Acquire a first single-object audio track based on the multimedia file.
  • the rendering device obtains the multimedia file by directly recording the sound of the first sound-emitting object, or it can be obtained by sending it by other devices, for example, by receiving a collection device (for example, a camera, a tape recorder, a mobile phone, etc. ), etc.
  • a collection device for example, a camera, a tape recorder, a mobile phone, etc.
  • the specific obtaining methods of multimedia files are not limited here.
  • the multimedia file in this embodiment of the present application may specifically be audio information, such as stereo audio information, multi-channel audio information, and the like.
  • the multimedia file may also be multi-modal information, for example, the multi-modal information is video information, image information corresponding to audio information, text information, and the like.
  • the multimedia file may include, in addition to the audio track, a video track or a text track (or called a bullet screen track), etc., which is not specifically limited here.
  • the multimedia file may include a first single-object audio track, or an original audio track, where the original audio track is synthesized by at least two single-object audio tracks, which is not specifically limited here.
  • the original audio track may be a single audio track or multiple audio tracks, which is not specifically limited here.
  • the original audio track can include vocal tracks, musical instrument tracks (for example: drum tracks, piano tracks, trumpet tracks, etc.), airplane sounds, and other audio tracks generated by sound-emitting objects (or called sound-emitting objects).
  • the specific type of the sounding object is not limited here.
  • the processing method of this step may be different, which are described below:
  • the audio track in the multimedia file is a single-object audio track.
  • the rendering device can directly acquire the first single-object audio track from the multimedia file.
  • the audio track in the multimedia file is a multi-object audio track.
  • the original audio track in the multimedia file corresponds to multiple sounding objects.
  • the original audio track also corresponds to the second sound-emitting object. That is, the original audio track is obtained by synthesizing at least the first single-object audio track and the second single-object audio track, the first single-object audio track corresponds to the first sounding object, and the second single-object soundtrack corresponds to the second sounding object.
  • the rendering device can separate the first single-object audio track from the original audio track, and can also separate the first single-object audio track and the second single-object audio track from the original audio track. limited.
  • the rendering device may separate the first single-object audio track from the original audio track through the separation network in the aforementioned embodiment shown in FIG. 5 .
  • the rendering device can also separate the first single-object audio track and the second single-object audio track from the original audio track through the separation network, which is not limited here. Reference may be made to the description in the embodiment shown in FIG. 5 , which will not be repeated here.
  • the rendering device can identify the sound-emitting object of the original audio track in the multimedia file through the identification network or the separation network.
  • the sound-emitting object contained in the original sound track includes the first sound-emitting object and the second sound-emitting object.
  • the sounding object The rendering device may randomly select one of the sound-emitting objects as the first sound-emitting object, or may determine the first sound-emitting object through the user's selection.
  • the rendering device determines the first sound-emitting object, the first single-object audio track can be acquired through the separation network.
  • the rendering device may first obtain the sound-emitting object through the identification network, and then obtain the single-object audio track of the sound-emitting object through the separation network.
  • the sounding object included in the multimedia file and the single-object audio track corresponding to the sounding object may also be obtained directly through the identification network and/or the separation network, which is not specifically limited here.
  • the rendering device may display the interface shown in FIG. 21 or the interface shown in FIG. 22 .
  • the user can select a multimedia file by clicking the "select input file icon" 105, and the multimedia file here is "Dream it possible.wav” as an example.
  • the rendering device receives the fourth operation of the user, and in response to the fourth operation, the rendering device selects "Dream it possible.wav” (that is, the target file) as the multimedia file from at least one multimedia file stored in the storage area. .
  • the storage area may be a storage area in the rendering device, or may be a storage area in an external device (such as a U disk, etc.), which is not specifically limited here.
  • the rendering device can display the interface as shown in Figure 23. In this interface, "select input file” can be replaced by "Dream it possible.wav” to prompt the user that the current multimedia file is: Dream it possible.wav.
  • the rendering device may use the identification network and/or the separation network in the embodiment shown in FIG. 4 to identify the sounding objects in "Dream it possible.wav” and separate the single-object audio track corresponding to each sounding object.
  • the rendering device recognizes that the sound-emitting objects included in "Dream it possible.wav” are people, pianos, violins, and guitars.
  • the interface displayed by the rendering device may also include an object bar, and icons such as "voice icon”, “piano icon”, “violin icon”, and “guitar icon” may be displayed in the object bar, for the user to select to be rendered the sounding object.
  • icons such as "voice icon”, “piano icon”, “violin icon”, and “guitar icon” may be displayed in the object bar, for the user to select to be rendered the sounding object.
  • a “coupling icon” may also be displayed in the object bar, and the user can stop the selection of the sounding object by clicking the "coupling icon”.
  • the user can click on the “voice icon” 106 to determine that the audio track to be rendered is a single-object audio track corresponding to a human voice.
  • the rendering device recognizes "Dream it possible.wav", obtains the rendering device and displays the interface shown in Figure 24, the rendering device receives the user's click operation, and the rendering device responds to the click operation and selects from the interface.
  • the first icon ie, the "voice icon” 106 ), thereby causing the rendering device to determine that the first single-object soundtrack is a human voice.
  • the playback device types shown in Figures 22 to 24 are only examples of external playback devices.
  • the user can select the type of playback device as headphones, and then only the playback device type selected by the user in the calibration is external playback device. Take the device as an example for schematic illustration.
  • the rendering device can also copy one or several single-object audio tracks in the original audio tracks.
  • the user can also copy the "Voice Icon" in the object bar to obtain the "Voice 2 icon” ”, the single-object track corresponding to vocal 2 is the same as the single-object track corresponding to vocal.
  • the method of copying may be that the user double-clicks the "voice icon”, or double-clicks the human voice on the ball view, which is not specifically limited here.
  • the user can copy and obtain the "Voice 2 Icon", which can be the default user to lose the control of the voice and start to control the voice 2.
  • the position of the first sound source of the vocal can also be displayed in the spherical view.
  • the user can not only copy the vocal object, but also delete the vocal object.
  • Step 1003 Determine the position of the first sound source of the first sound-emitting object based on the reference information.
  • the sound source position of one sound-emitting object may be determined based on the reference information, and the sound source positions corresponding to the multiple sound-emitting objects may also be determined. Specifically, this There are no restrictions.
  • the rendering device determines that the first sounding object is a human voice
  • the rendering device can display the interface shown in FIG. 26
  • the user can click the “Select Rendering Mode Icon” 107 to select reference information, which is used to determine The position of the first sound source of the first sound-emitting object.
  • the rendering device may display a pull-down menu, and the pull-down menu may include "automatic rendering options" and "interactive rendering options".
  • the "interactive rendering option” corresponds to the reference position information
  • the “automatic rendering option” corresponds to the media information.
  • the rendering mode includes an automatic rendering mode and an interactive rendering mode
  • the automatic rendering mode means that the rendering device automatically obtains the rendered first single-object audio track according to the media information in the multimedia file.
  • the interactive rendering method refers to obtaining the rendered first single-object audio track through the interaction between the user and the rendering device.
  • the rendering device can obtain the rendered first single-object audio track based on the preset mode; or when the interactive rendering mode is determined, in response to the user's second operation to obtain reference position information; determine the first sound source position of the first sound-emitting object based on the reference position information; render the first single-object audio track based on the first sound source position to obtain the rendered first single-object audio track.
  • the preset method includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to obtain the rendered audio track.
  • the first single-object track includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to obtain the rendered audio track.
  • the first single-object track includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to obtain the rendered audio track.
  • the first single-object track includes: acquiring media information of the multimedia file; determining the first sound source position of the first sound-emitting object based on the media information; rendering the first single-object audio track based on the first sound source position, to
  • the sound source positions in the embodiments of the present application may be fixed positions at a certain moment, or may be multiple positions (for example, motion trajectories) within a certain period of time. , which is not specifically limited here.
  • the reference information includes reference location information.
  • the reference position information in this embodiment of the present application is used to indicate the position of the sound source of the first sounding object, and the reference position information may be the first position information of the sensor device, or may be the second position information selected by the user, etc. Do limit.
  • the reference location information in the embodiments of the present application has various situations, which are described below:
  • the reference position information is the first position information of the sensor device (hereinafter referred to as the sensor).
  • the user may click the “interactive rendering option” 108 to determine that the rendering mode is interactive rendering.
  • the rendering device may display a drop-down menu in response to the click operation, the drop-down menu may include "Orientation Control Options", “Position Control Options”, and "Interface Control Options".
  • the first location information in the embodiment of the present application has various situations, which are described below:
  • the first position information includes the first attitude angle of the sensor.
  • the user can adjust the orientation of the handheld sensor device (such as a mobile phone) through a second operation (such as panning up, down, left, and right) to determine the position of the first sound source of the first single-object track.
  • the rendering device can receive the first attitude angle of the mobile phone, and use the following formula 3 to obtain the first sound source position of the first single-object audio track, where the first sound source position includes the azimuth angle, the inclination angle and the outer position. Put the distance between the device and the phone.
  • the user can further determine the position of the second sound source of the second single-object audio track by adjusting the orientation of the handheld mobile phone.
  • the rendering device can receive the first attitude angle (including the azimuth angle and the inclination angle) of the mobile phone, and use the following formula 3 to obtain the second sound source position of the second single-object audio track, the second sound source position. Including azimuth, tilt, and the distance between the external device and the phone.
  • the rendering device may send reminder information to the user, where the reminder information is used to remind the user to connect the mobile phone and the rendering device.
  • the mobile phone and the rendering device may also be the same mobile phone, and in this case, no reminder information needs to be sent.
  • ⁇ (t) is the azimuth angle of the ith external device in the spherical coordinate system at time t
  • ⁇ (t) is the inclination angle of the ith external device in the spherical coordinate system at time t
  • d( t) is the distance between the mobile phone and the i-th external device at time t
  • ⁇ (t)[0] is the azimuth angle of the mobile phone (that is, the rotation angle of the mobile phone around the z-axis) at time t
  • ⁇ (t)[ 1] is the tilt angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the y-axis)
  • d(t) is the distance between the mobile phone and the i-th external device at time t, which can be calculated by formula 1 during calibration.
  • the rendering device may display a pull-down menu, and the pull-down menu may include “orientation control options”, “position control options” and “interface control options”.
  • the user can click the "Orientation Control Option” 109, and then determine that the rendering mode is the orientation control in the interactive rendering.
  • the rendering device can display the interface as shown in Figure 29. In this interface, "select rendering mode" can be replaced by "orientation control rendering” to prompt the user that the current rendering mode is the orientation control. At this time, the user can adjust the orientation of the mobile phone.
  • the spherical view in the display interface of the rendering device can display a dotted line, which is used to indicate the current orientation of the mobile phone, so that the user can intuitively The orientation of the mobile phone in the spherical view can be seen, thereby facilitating the user to determine the position of the first sound source.
  • the orientation of the mobile phone is stable (refer to the above explanation about the stable positioning of the mobile phone, which will not be repeated here)
  • the current first attitude angle of the mobile phone is determined. Further, the first sound source position is obtained through the above formula 3.
  • the distance between the mobile phone and the external device obtained during calibration can be used as d(t) in the above formula 3.
  • the user determines the first sound source position of the first sound-emitting object based on the first attitude angle, or it is understood as the first sound source position of the first single-object audio track.
  • the rendering device may display an interface as shown in FIG. 31 , the spherical view of the interface includes the first position information of the sensor corresponding to the position of the first sound source (ie, the first position information of the mobile phone).
  • the above example describes an example of determining the position of the first sound source.
  • the user can also determine the position of the second sound source of the second single-object audio track.
  • the user may determine that the second sounding object is a violin by clicking on the “violin icon” 110 .
  • the rendering device monitors the attitude angle of the mobile phone, and determines the position of the second sound source by formula 3.
  • the spherical view in the display interface of the rendering device can display the currently determined first sound source position of the first sound-emitting object (person) and the second sound source position of the second sound-emitting object (violin).
  • the user can perform real-time or later dynamic rendering of the selected sound-emitting object through the orientation provided by the sensor (ie, the first attitude angle).
  • the sensor is like a laser pointer, and the direction of the laser is the position of the sound source.
  • the control can give specific spatial orientation and motion to the sounding object, realize the interactive creation between the user and the audio, and provide the user with a new experience.
  • the first position information includes the second attitude angle and acceleration of the sensor.
  • the user can control the position of the sensor device (eg, mobile phone) through the second operation, so as to determine the position of the first sound source.
  • the rendering device can receive the second attitude angle (including the azimuth angle, the inclination angle and the pitch angle) and the acceleration of the mobile phone, and use the following formulas 4 and 5 to obtain the position of the first sound source, the first sound source.
  • Location includes azimuth, tilt, and the distance between the external device and the phone. That is, first, the second attitude angle and acceleration of the mobile phone are converted into the coordinates of the mobile phone in the space rectangular coordinate system by formula 4, and then the coordinates of the mobile phone in the space rectangular coordinate system are converted into the mobile phone in the spherical coordinate system by formula 5. Coordinates, that is, the position of the first sound source.
  • x(t), y(t), z(t) are the position information of the mobile phone in the space rectangular coordinate system at time t
  • g is the acceleration of gravity
  • a(t) is the acceleration of the mobile phone at time t
  • ⁇ ( t)[0] is the azimuth angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the z-axis)
  • ⁇ (t)[1] is the pitch angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the x-axis)
  • ⁇ (t)[2] is the tilt angle of the mobile phone at time t (that is, the rotation angle of the mobile phone around the y-axis)
  • ⁇ (t) is the azimuth angle of the i-th external device at time t
  • ⁇ (t) is t
  • the inclination angle of the i-th external device at time, d(t) is the distance between the i-th external
  • the rendering device displays the interface shown in FIG. 27
  • the rendering device may display a drop-down menu that may include "Orientation Control Options," “Position Control Options,” and “Interface Control Options.”
  • the user can click “Orientation Control Option” as shown in FIG. 33 , and the user can click “Position Control Option” 111 , and then determine that the rendering mode is the position control in the interactive rendering.
  • "select rendering mode" can be replaced by "position control rendering” to prompt the user that the current rendering mode is position control.
  • the user can adjust the position of the mobile phone, and determine the second attitude angle and acceleration of the current mobile phone after the position of the mobile phone is stable (refer to the above explanation about the stability of the mobile phone position, which will not be repeated here). Further, the position of the first sound source is obtained through the above formula 4 and formula 5. Further, the user determines the first sound source position of the first sound-emitting object based on the second attitude angle and the acceleration, or understands it as the first sound source position of the first single-object audio track. Further, during the process of adjusting the mobile phone by the user, or after the position of the mobile phone is stable, the rendering device can display the interface shown in FIG.
  • the spherical view of the interface includes the first position information of the sensor corresponding to the position of the first sound source. (that is, the first location information of the mobile phone).
  • the user can intuitively see the position of the mobile phone in the spherical view, thereby facilitating the user to determine the position of the first sound source. If the interface of the rendering device displays the first position information in the spherical view when the user adjusts the position of the mobile phone, it can be changed in real time according to the position change of the mobile phone.
  • the above example describes an example of determining the position of the first sound source.
  • the user can also determine the position of the second sound source of the second single-object track, and the method of determining the position of the second sound source is the same as determining the position of the first sound source.
  • the source location is similar and will not be repeated here.
  • first location information is only examples, and in practical applications, the first location information may also have other situations, which are not specifically limited here.
  • the reference location information is the second location information selected by the user.
  • the rendering device may provide a spherical view for the user to select the second position information, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the external device.
  • the rendering device acquires the second position information selected by the user in the spherical view, and converts the second position information into the position of the first sound source. It can also be understood that the rendering device obtains the second position information of a certain point selected by the user in the spherical view, and converts the second position information of the point into the position of the first sound source.
  • the second position information includes the two-dimensional coordinates of the point selected by the user on the tangent plane in the spherical view and the depth (ie, the distance between the tangent plane and the center of the sphere).
  • the rendering device displays the interface shown in FIG. 27 , after the user determines that the rendering mode is interactive rendering.
  • the rendering device may display a drop-down menu that may include "Orientation Control Options,” “Position Control Options,” and “Interface Control Options.”
  • the user can click the “orientation control option” as shown in FIG. 35 , the user can click the “interface control option” 112 , and then determine that the rendering mode is the interface control in the interactive rendering.
  • the rendering device can display the interface shown in Figure 36. In this interface, "select rendering mode" can be replaced by "interface control rendering” to prompt the user that the current rendering mode is interface control.
  • the second location information in the embodiments of the present application has various situations, which are described below:
  • the second position information is obtained according to the user's selection on the vertical section.
  • the rendering device obtains the two-dimensional coordinates of the point selected by the user on the vertical section and the distance between the vertical section where the point is located and the center of the circle (hereinafter referred to as the depth), and uses the following formula 6 to convert the two-dimensional coordinates and depth into the position of the first sound source , the position of the first sound source includes the azimuth angle, the inclination angle and the distance between the external device and the mobile phone.
  • the right side of the interface of the rendering device can display the spherical view, the vertical slice, and the depth control bar.
  • the depth control bar is used to adjust the distance between the vertical section and the center of the sphere.
  • the user can click on a point (x, y) on the horizontal plane (shown as 114), and the corresponding spherical view in the upper right corner will display the position of the point in the spherical coordinate system.
  • the horizontal slice is jumped by default, the user can click the meridian in the ball view (as shown by 113 in Figure 37), and the interface will display the vertical slice in the interface shown in Figure 37.
  • the user can also adjust the distance between the vertical section and the center of the sphere through a sliding operation (as shown by 115 in FIG. 37 ).
  • the second position information includes the two-dimensional coordinates (x, y) of the point and the depth r. And use formula 6 to obtain the position of the first sound source.
  • x is the abscissa of the point selected by the user in the vertical section
  • y is the ordinate of the point selected by the user in the vertical section
  • r is the depth
  • is the azimuth of the i-th external device
  • is the i-th external device
  • d is the distance between the i-th external device and the mobile phone (it can also be understood as the distance between the i-th external device and the user);
  • the second position information is obtained according to the user's selection on the horizontal section.
  • the rendering device obtains the two-dimensional coordinates of the point selected by the user on the horizontal section and the distance between the horizontal section where the point is located and the center of the circle (hereinafter referred to as the depth), and uses the following formula 7 to convert the two-dimensional coordinates and depth into the position of the first sound source , the position of the first sound source includes the azimuth angle, the inclination angle and the distance between the external device and the mobile phone.
  • a spherical view, a horizontal slice, and a depth control bar can be displayed on the right side of the interface of the rendering device.
  • the depth control bar is used to adjust the distance between the horizontal slice and the center of the sphere.
  • the user can click a point (x, y) on the horizontal plane (as shown in 117), and the corresponding spherical view in the upper right corner will display the position of the point in the spherical coordinate system.
  • the second position information includes the two-dimensional coordinates (x, y) of the point and the depth r. And use formula 7 to obtain the position of the first sound source.
  • x is the abscissa of the point selected by the user in the vertical section
  • y is the ordinate of the point selected by the user in the vertical section
  • r is the depth
  • is the azimuth of the i-th external device
  • is the i-th external device
  • d is the distance between the i-th external device and the mobile phone (it can also be understood as the distance between the i-th external device and the user);
  • the above two manners of referring to the location information are just examples, and in practical applications, the reference location information may also have other situations, which are not specifically limited here.
  • the user can select the second position information (such as clicking, dragging, sliding and other second operations) through the spherical view to control the selected sound-emitting object and perform real-time or later dynamic rendering, giving it a specific spatial orientation
  • users can create interactive creations with audio, providing users with a new experience.
  • the reference information includes media information of the multimedia file.
  • the media information in the embodiment of the present application includes at least one of the text to be displayed in the multimedia file, the image to be displayed in the multimedia file, the musical feature of the music in the multimedia file, the sound source type corresponding to the first sounding object, and the like, There is no specific limitation here.
  • determining the position of the first sound source of the first sound-emitting object based on the music characteristics of the music in the multimedia file or the sound source type corresponding to the first sound-emitting object may be understood as automatic 3D remixing. Determining the position of the first sound source of the first sound-emitting object based on the position text to be displayed in the multimedia file or the image to be displayed in the multimedia file can be understood as multimodal remixing. They are described below:
  • the rendering device determines that the first sounding object is a human voice
  • the rendering device can display the interface shown in Figure 26, and the user can click the "Select Rendering Mode Icon" 107 to select the rendering mode, which is used to determine the rendering mode.
  • the position of the first sound source of the first sound-emitting object As shown in FIG. 39, in response to the click operation, the rendering device may display a drop-down menu, which may include "Automatic Rendering Options" and "Interactive Rendering Options". Among them, the "interactive rendering option" corresponds to the reference position information, and the "automatic rendering option" corresponds to the media information. Further, as shown in FIG. 39 , the user can click “Auto Rendering Options” 119 to determine that the rendering mode is automatic rendering.
  • the user may click on “Auto Rendering Options” 119, and the rendering device may, in response to the click operation, display a drop-down menu as shown in FIG. 40, the drop-down menu may include “Auto 3D Remix Options” and "Multimodal Remix Option".
  • the user can click the "Auto 3D Remix option” 120, and after the user selects the automatic 3D remix, the rendering device can display the interface shown in FIG. 41, in which the interface can be replaced by "Auto 3D Remix” "Select Rendering Mode” to prompt the user that the current rendering mode is automatic 3D remixing.
  • the media information includes the music characteristics of the music in the multimedia file.
  • the music feature in the embodiments of the present application may refer to at least one of music structure, music emotion, singing mode, and the like.
  • the music structure may include prelude, prelude vocals, verse, transition section, or chorus, etc.; music emotion includes cheerfulness, sadness, or panic, etc.; singing mode includes solo, chorus, or accompaniment, etc.
  • the rendering device After the rendering device determines the multimedia file, it can analyze the music features in the audio track (which can also be understood as audio, songs, etc.) of the multimedia file.
  • the music feature can also be identified by an artificial method or a neural network method, which is not specifically limited here.
  • the position of the first sound source corresponding to the music feature may be determined according to a preset association relationship, where the association relationship is the relationship between the music feature and the position of the first sound source.
  • the rendering device determines that the position of the first sound source is surround, and the rendering device can display an interface as shown in FIG.
  • the musical structure may generally include at least one of an intro, intro vocals, verse, transition, or chorus.
  • the following is a schematic illustration of analyzing the structure of a song as an example.
  • the vocal and musical instrument sounds are separated for the song, which can be separated manually or through a neural network, which is not specifically limited here.
  • the song can be segmented by judging the mute paragraph of the human voice and the variance of the pitch.
  • the specific steps include: if the mute of the human voice is greater than a certain threshold (for example, 2 seconds), it is considered that the paragraph is over, thereby affecting the song's silence. Divide into large paragraphs. If there is no human voice in the first large paragraph, it is determined that the large paragraph is an instrumental intro; And determine the large paragraph with the silence in the middle as the transitional paragraph.
  • large vocal paragraphs calculate the center frequency of each large paragraph including vocals (hereinafter referred to as large vocal paragraphs) by the following formula 8, and calculate the variance of the center frequency at all times in the large vocal paragraph, and perform the large vocal paragraphs according to the variance. Sorting, the vocal passages with variance in the first 50% are marked as chorus, and the last 50% vocal passages are marked as main songs.
  • the musical characteristics of the song are determined by the fluctuation of the frequency, and then in the subsequent rendering, for different large paragraphs, the position of the sound source or the motion trajectory of the position of the sound source can be determined through the preset association relationship, and then the song's sound source position can be determined. Different large paragraphs are rendered.
  • the music feature is a prelude
  • the first sound source position is a circle above the user (or understood as surround)
  • first down-mixing the multi-channel to mono such as average
  • the whole prelude stage Set the whole vocal to make a circle around the head, and the speed of each moment is determined according to the vocal energy (RMS or variance representation), the higher the energy, the faster the rotation speed.
  • the characteristic of the music is panic, it is determined that the position of the first sound source is flickering to the right and flickering to the left.
  • the music feature is chorus, you can expand and widen the vocals of the left and right channels to increase the delay. Determine the number of instruments in each time period. If there is a solo instrument, let the instrument circle according to the energy in the solo period.
  • f c is the center frequency of vocal large paragraphs per 1 second
  • N is the number of large paragraphs
  • N is a positive integer
  • f(n) is the frequency domain obtained by the Fourier transform of the time-domain waveform corresponding to the large paragraph
  • x(n) is the energy corresponding to a certain frequency.
  • the orientation and dynamics of the extracted specific sound-emitting objects are set according to the musical characteristics of the music, so that our 3D rendering is more natural and the artistry is better reflected.
  • the media information includes the sound source type corresponding to the first sounding object.
  • the sound source types in the embodiments of the present application may be people, musical instruments, drum sounds, piano sounds, etc. In practical applications, they may be divided according to needs, which is not specifically limited here.
  • the rendering device may identify the type of the sound source through an artificial method or a neural network method, which is not specifically limited here.
  • the first sound source position corresponding to the sound source type can be determined according to a preset association relationship, which is the relationship between the sound source type and the first sound source position (similar to the aforementioned music features, not here. repeat).
  • the user may select a multimedia file by clicking the “Select Input File Icon” 121 , and the multimedia file here is “car.mkv” as an example.
  • the rendering device receives the fourth operation of the user, and in response to the fourth operation, the rendering device selects "car.mkv” (ie, the target file) from the storage area as a multimedia file.
  • the storage area may be a storage area in the rendering device, or may be a storage area in an external device (such as a U disk, etc.), which is not specifically limited here.
  • the rendering device can display the interface shown in FIG.
  • “select input file” can be replaced by “car.mkv” to prompt the user that the current multimedia file is: car.mkv.
  • the rendering device may use the identification network and/or the separation network in the embodiment shown in FIG. 4 to identify the sounding objects in "car.mkv” and separate the single-object audio track corresponding to each sounding object. For example, the rendering device recognizes that the sound-emitting objects included in "car.mkv” are people, cars, and wind sounds.
  • the interface displayed by the rendering device may also include an object bar, and icons such as "voice icon", "car icon”, and "wind sound icon” may be displayed in the object bar, for the user to select the sound to be rendered. object.
  • the media information includes images to be displayed in the multimedia file.
  • the rendering device obtains the multimedia file (the audio track containing the image or the video)
  • the video can be split into frame images (the number can be one or more), and the third position of the first sound-emitting object is obtained based on the frame images. information, and obtain the position of the first sound source based on the third position information, where the third position information includes the two-dimensional coordinates and depth of the first sound-emitting object in the image.
  • the specific step of obtaining the position of the first sound source based on the third position information may include: inputting the frame image to the detection network, and obtaining the tracking frame information (x 0 , y ) corresponding to the first sounding object in the frame image. 0 , w 0 , h 0 ), of course, the frame image and the first sounding object can also be used as the input of the detection network, and the detection network outputs the tracking frame information of the first sounding object.
  • the tracking frame information includes two-dimensional coordinates (x 0 , y 0 ) of a corner point of the tracking frame, and height h 0 and width w 0 of the tracking frame.
  • the rendering device uses formula 9 to calculate the tracking frame information (x 0 , y 0 , w 0 , h 0 ) to obtain the coordinates of the center point of the tracking frame (x c , y c ), and then the coordinates of the center point of the tracking frame (x c , y c ) Input to the depth estimation network to obtain the relative depth of each point in the tracking frame Then use formula ten to calculate the relative depth of each point in the tracking frame Obtain the average depth z c of all points within the tracking box.
  • ⁇ i y norm * ⁇ y_max ;
  • (x 0 , y 0 ) is the two-dimensional coordinates of a corner point of the tracking frame (for example, the corner point in the lower left corner), h 0 is the height of the tracking frame, w 0 is the width of the tracking frame; h 1 is the image height, w 1 is the width w 1 of the image; is the relative depth of each point in the tracking frame, z c is the average depth of all points in the tracking frame; ⁇ x_max is the playback device (if the playback device is N external devices, the playback device information of the N external devices is the same ) maximum horizontal angle, ⁇ y_max is the maximum vertical angle of the playback device, dy_max is the maximum depth of the playback device; ⁇ i is the azimuth angle of the i-th external device, ⁇ i is the inclination of the i-th external device angle, ri is the distance between the ith external device and the user.
  • the user can click on the “multi-modal remix option” 122 , and the rendering device can respond to the click operation to display the interface as shown in FIG. 43 , the right side of the interface includes “car.mkv” A certain frame (for example, the first frame) image and playback device information, where the playback device information includes the maximum horizontal angle, the maximum vertical angle, and the maximum depth.
  • the playback device is an earphone, the user can input playback device information.
  • the playback device is an external playback device, the user can input the playback device information or directly use the calibration information obtained in the calibration phase as the playback device information, which is not specifically limited here.
  • the rendering device may display an interface as shown in Figure 44, in which the "select rendering method" may be replaced by "multi-modal remix” to prompt the user of the current The rendering method is multimodal remixing.
  • the media information includes an image to be displayed in the multimedia file
  • the media information includes an image to be displayed in the multimedia file
  • there are several ways for determining the first sound-emitting object in the image which are described below:
  • the first sounding object is determined by the user's click in the object column.
  • the rendering device may determine the first sound-emitting object based on the user's click in the object bar.
  • the user may determine that the sound-emitting object to be rendered is a car by clicking on the “car icon” 123 .
  • the rendering device displays the tracking frame of the car in the image on the right side of "car.mkv", and then obtains the third position information, and converts the third position information into the first sound source position through Formula 9 to Formula 12.
  • the interface also includes the corner point coordinates (x 0 , y 0 ) of the lower left corner of the tracking frame and the center point coordinates (x c , y c ).
  • the maximum horizontal angle in the external device information is 120 degrees
  • the maximum vertical angle is 60 degrees
  • the maximum depth is 10 (units may be meters, decimeters, etc., which are not specifically limited here).
  • the first sounding object is determined by the user's click on the image.
  • the rendering device may use the sound-emitting object determined by the user's third operation (eg, clicking) in the image as the first sound-emitting object.
  • the user may determine the first sounding object by clicking on the sounding object (as shown in 124 ) in the image.
  • the first sounding object is determined according to the default setting.
  • the rendering device may identify the sound-emitting object through the audio track corresponding to the image, may track the default sound-emitting object or all sound-emitting objects in the image, and determine the third position information.
  • the third position information includes the two-dimensional coordinates of the sounding object in the image and the depth of the sounding object in the image.
  • the rendering device may select "Close” by default in the object column, that is, all sound-emitting objects in the image are tracked, and the third position information of the first sound-emitting object is determined respectively.
  • the 3D immersion is rendered in the earphone or external environment, so that the real sound moves with the picture, allowing the user to obtain The best sound experience.
  • the technology of tracking and rendering the audio of the object in the entire video after selecting the sounding object can also be applied in professional mixing post-production, improving the work efficiency of the mixer.
  • the media information includes the position text to be displayed in the multimedia file.
  • the rendering device may determine the position of the first sound source based on the position text to be displayed in the multimedia file, where the position text is used to indicate the position of the first sound source.
  • the position text can be understood as a text with meanings such as position and orientation, for example: wind blows north, heaven, hell, front, back, left, right, etc., which are not specifically limited here.
  • the position text may specifically be lyrics, subtitles, advertisement slogans, etc., which is not specifically limited here.
  • the semantics of the displayed location characters can be recognized based on reinforcement learning or a neural network, and then the location of the first sound source is determined according to the semantics.
  • step 1003 various situations are described on how to determine the position of the first sound source based on the reference information.
  • the position of the first sound source can also be determined in a combined manner. For example, after the position of the first sound source is determined by the sensor orientation, the motion trajectory of the first sound source position is determined by the music feature.
  • the rendering device based on the first attitude angle of the sensor, the rendering device has determined that the position of the sound source of the human voice is as shown on the right side of the interface in FIG. 46 . Further, the user can further determine the movement trajectory of the human voice by clicking the "circle option" 125 in the menu on the right side of the "voice icon".
  • the position of the first sound source at a certain moment is determined by the orientation of the sensor first, and the motion trajectory of the position of the first sound source is determined by using music features or preset rules as a circle.
  • the interface of the rendering device can display the movement track of the generated object.
  • the user can control the distance in the position of the first sound source by controlling the volume key of the mobile phone or clicking, dragging, sliding, etc. on the spherical view.
  • Step 1004 Perform spatial rendering on the first single-object audio track based on the position of the first sound source.
  • the rendering device may perform spatial rendering on the first single-object soundtrack, and obtain the rendered first single-object soundtrack.
  • the rendering device performs spatial rendering on the first single-object audio track based on the position of the first sound source, and obtains the rendered first single-object audio track.
  • the rendering device can also spatially render the first single-object audio track based on the first sound source position, and render the second single-object audio track based on the second sound source position, and obtain the rendered first single-object audio track. and a second single-object track.
  • the method for determining the position of the sound source in this embodiment of the present application may apply the various methods in step 1003.
  • the combination is not specifically limited here.
  • the first sounding object is a person
  • the second sounding object is a violin
  • the position of the first sound source of the first single-object track corresponding to the first sounding object may adopt a certain method in interactive rendering.
  • the second sound source position of the second single-object audio track corresponding to the second sound-emitting object may adopt a certain method in automatic rendering.
  • the specific determination methods of the first sound source position and the second sound source position can be any two of the foregoing step 1003.
  • the specific determination methods of the first sound source position and the second sound source position can also adopt the same method. , which is not specifically limited here.
  • the spherical view may also include a volume bar, and the user can control the volume of the first single-object audio track by performing operations such as finger sliding, mouse dragging, mouse wheel, etc. on the volume bar, Improve the real-time performance of rendering audio tracks.
  • the user can adjust the volume bar 126 to adjust the volume of the single-object track corresponding to the guitar.
  • the rendering method in this step may be different. It can also be understood that the rendering device spatially renders the original audio track or the first single-object audio track based on the location of the first sound source and the type of playback device. The method varies according to the type of playback device, as described below:
  • the playback device type is headphones.
  • the audio track can be rendered based on Formula 13 and the HRTF filter coefficient table.
  • the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
  • the HRTF filter coefficient table is used to represent the relationship between the sound source position and the coefficient, and it can also be understood that one sound source position corresponds to one HRTF filter coefficient.
  • a s (t) is the adjustment of the first sounding object at time t Coefficient
  • h i,s (t) is the HRTF filter coefficient of the head-related transfer function of the left channel or right channel corresponding to the first sounding object at time t, where the left channel corresponding to the first sounding object at time t
  • the HRTF filter coefficient is generally different from the HRTF filter coefficient of the right channel corresponding to the first sounding object at time t.
  • the HRTF filter coefficient is related to the position of the first sound source, and o s (t) is at time t.
  • the first single-object track of ⁇ is the integral term.
  • the playback device type is an external playback device.
  • the audio track can be rendered based on Formula Fourteen.
  • the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
  • the number of external devices can be N, is the rendered first single-object audio track, i indicates the ith channel in the multi-channel, S is the sounding object of the multimedia file and includes the first sounding object, and a s (t) is the first sounding at time t
  • the adjustment coefficient of the object g s (t) represents the translation coefficient of the first sounding object at time t, os (t) is the first single-object track at time t, and ⁇ i is the calibrator (such as the aforementioned sensor device).
  • the azimuth angle obtained by calibrating the i-th external device ⁇ i is the inclination angle obtained by the calibrator calibrating the i-th external device, ri is the distance between the i -th external device and the calibrator, and N is a positive Integer, i is a positive integer and i ⁇ N, the position of the first sound source is within a tetrahedron formed by N external devices.
  • the single-object audio track corresponding to a certain sound-emitting object in the original audio track may be rendered and replaced, for example, S 1 in the above formula. It may also be a rendering of a single-object audio track corresponding to a certain sound-emitting object in the original audio track after duplication and addition, for example, S 2 in the above formula. Of course, it can also be a combination of the above S1 and S2 .
  • Step 1005 Acquire a target audio track based on the rendered first single-object audio track.
  • the target audio track obtained in this step may be different, and it can also be understood as the method used by the rendering device to obtain the target audio track, which varies according to the type of playback device, as described below:
  • the playback device type is headphones.
  • the rendering device may acquire the target audio track based on Formula 15 and the rendered audio track.
  • the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
  • i indicates left or right channel
  • X i (t) is the original track at time t
  • X i (t) is the original track at time t
  • a s (t) is the adjustment coefficient of the first sounding object at time t
  • hi is the left channel or right channel corresponding to the first sounding object at time t
  • HRTF filter coefficient of the channel is related to the position of the first sound source
  • o s (t) is the first single-object track at time t
  • is the integral term
  • S 1 is the original The sounding object that needs to be replaced in the audio track, if the first sounding object is to replace the sounding object in the original audio track, then S1 is an empty set ;
  • S2 is the sounding object added by the target audio track compared to the original audio track, if The first sounding object is the sounding object in the copied original audio track, then
  • S2 is an empty set, it can be understood that the spatial rendering of the audio track is to replace the sound - emitting object. After spatial rendering of the single-object soundtrack corresponding to the sound-emitting object, the rendered single-object soundtrack is used to replace the original in the multimedia file.
  • Single object audio track In other words, compared to the multimedia file, the target audio track does not have a single-object audio track corresponding to the multi-voice object, but replaces the original single-object audio track in the multimedia file with the rendered single-object audio track.
  • the playback device type is an external playback device.
  • the rendering device may acquire the target audio track based on Formula 16 and the rendered audio track.
  • the audio track may be a first single-object audio track, a second single-object audio track, or a first single-object audio track and a second single-object audio track, which is not specifically limited here.
  • the number of external devices can be N, i indicates a certain channel in the multi-channel, is the target track at time t, X i (t) is the original track at time t, is the first single audio track that is not rendered at time t, is the rendered first single-object audio track, a s (t) is the adjustment coefficient of the first sound-emitting object at time t, g s (t) represents the translation coefficient of the first sound-emitting object at time t, g i,s ( t) represents the ith row in g s (t), o s (t) is the first single-object track at time t, and S 1 is the sounding object that needs to be replaced in the original soundtrack.
  • S1 is an empty set
  • S2 is the sounding object added by the target soundtrack compared to the original soundtrack.
  • S 2 is an empty set
  • S 1 and/or S 2 are the sound-emitting objects of the multimedia file and include the first sound-emitting object
  • ⁇ i is the azimuth angle obtained by the calibrator calibrating the i-th external device
  • ⁇ i is the calibrator calibration
  • the inclination angle obtained by the ith external device, ri is the distance between the ith external device and the calibrator
  • N is a positive integer
  • i is a positive integer
  • i ⁇ N the first sound source is located in N external speakers within the tetrahedron formed by the device.
  • a new multimedia file can also be generated according to the multimedia file and the target audio track, which is not specifically limited here.
  • the user can upload the setting method of the sound source position during the rendering process to the database module corresponding to the aforementioned Figure 8, so as to facilitate other users to use this setting method to render other audio tracks.
  • the user can also download the setting method from the database module and modify it to facilitate the spatial rendering of the audio track.
  • the modification of rendering rules and sharing between different users are added.
  • the multimodal mode the repeated object recognition and tracking of the same file can be avoided, and the overhead on the end side can be reduced; on the other hand, the The user's free creation in the interactive mode is shared with other users to further enhance the interactivity of the application.
  • the user may choose to synchronize the rendering rule file stored in the local database to other devices of the user.
  • the user can choose to upload the rendering rule file stored in the local database to the cloud for sharing with other users, and other users can choose to download the corresponding rendering rule file from the cloud database to the terminal.
  • the metadata file stored in the database is mainly used to render the sound-emitting object separated by the system or the object specified by the user in automatic mode, or the sound-emitting object specified by the user that needs to be automatically rendered according to the stored rendering rules in mixed mode. render.
  • the metadata files stored in the database can be prefabricated by the system, as shown in Table 1.
  • the serial numbers 1 and 2 in Table 1 can also be created by the user when using the interactive mode of the present invention, such as the serial numbers 3-6 in Table 1; they can also be automatically identified by the system in the multi-modal mode. It is stored after specifying the motion trajectory of the sounding object, such as No. 7 in Table 1.
  • the metadata file can be strongly related to the audio content in the multimedia file or the multimodal file content: for example, serial number 3 in Table 1 is the metadata file corresponding to audio file A1, and serial number 4 is the metadata file corresponding to audio file A2; It can also be decoupled from the multimedia file: the user performs an interactive operation on the object X of the audio file A in the interactive mode, and saves the motion trajectory of the object X as the corresponding metadata file (for example, the serial number 5 in Table 1 hovers freely. rising state), when using automatic rendering next time, the user can select the metadata file in the free hovering rising state from the database module to render the object Y of the audio file B.
  • serial number 3 in Table 1 is the metadata file corresponding to audio file A1
  • serial number 4 is the metadata file corresponding to audio file A2
  • It can also be decoupled from the multimedia file: the user performs an interactive operation on the object X of the audio file A in the interactive mode, and saves the motion trajectory of the object X as the corresponding metadata file (for
  • the rendering method provided by this embodiment of the present application includes steps 1001 to 1005 .
  • the rendering method provided by this embodiment of the present application includes steps 1002 to 1005.
  • the rendering method provided by this embodiment of the present application includes steps 1001 to 1004 .
  • the rendering method provided by this embodiment of the present application includes steps 1002 to 1004 .
  • each step shown in FIG. 10 in the embodiment of the present application does not limit the timing relationship.
  • step 1001 in the above method may also be performed after step 1002, that is, after the audio track is acquired, and the playback device is calibrated.
  • the user can control the audio image, quantity, and volume of a specific sound-emitting object through the mobile phone sensor, drag and drop the sound image, quantity, and volume of the specific sound-emitting object through the mobile phone interface, and control the music through automated rules.
  • the spatial rendering of specific sound-emitting objects in the system improves spatiality, the automatic rendering of sound source positions through multi-modal recognition, and the method of rendering for single-emitting objects provides a completely different sound effect experience from the traditional music movie interaction mode. It provides a new interactive way for music appreciation.
  • automated 3D re-production enhances the sense of space of binaural music and brings listening to a new level.
  • the interactive method designed by us has been added separately to enhance the user's ability to edit audio, which can be applied to the production of sound-emitting objects in music and film and television works, and simply edit the motion information of specific sound-emitting objects. It also increases the user's control and playability of music, allowing users to experience the fun of making audio by themselves and the ability to control specific sound-emitting objects.
  • the present application also provides two specific application scenarios for applying the above-mentioned rendering method, which are described below:
  • the first is the "Sound Hunter" game scene.
  • This scenario can also be understood as the user points to the sound source position and judges whether the user's pointing is consistent with the actual sound source position, and scores the user's operation to improve the user's entertainment experience.
  • the device can display the interface shown in Figure 51.
  • the user can click the play button at the bottom of the interface to confirm the start of the game, and the playing device will play at least one single-object track in a certain order and at any position.
  • the playback device plays the single-object track of the piano, the user judges the position of the sound source by hearing, and holds the mobile phone to point to the position of the sound source judged by the user. If it is consistent (or the error is within a certain range), the rendering device can display the prompt "hit the first instrument, it took 5.45 seconds, and beat 99.33% of the people in the universe" as shown in the right interface of Figure 51.
  • the corresponding sounding object in the object bar can change from red to green.
  • a failure may be displayed.
  • the preset time period time interval T in FIG. 54
  • the next single-object track is played to continue the game, as shown in FIG. 52 and FIG. 53 .
  • the corresponding sounding object in the object bar can remain red.
  • the rendering device can display the interface shown in FIG. 53 .
  • the orientation of the audio of the object is rendered in the playback system in real time, and the game is designed so that the user can obtain the ultimate "listening to position" experience. It can be applied to home entertainment, AR, VR games, etc. Compared with the prior art technology about "listening to the voice", which is only for a whole song, the present application provides a game that is played after separating the vocal instruments for a song.
  • the second is a multi-person interactive scene.
  • This scene can be understood as multiple users controlling the sound source position of a specific sounding object respectively, so as to realize the rendering of the sound track by multiple people, and increase the entertainment and communication among the multiple users.
  • the interactive scene may specifically be an online multi-person band or an online host controlling a symphony, etc.
  • the multimedia file is music composed of multiple musical instruments.
  • User A can select a multi-person interaction mode and invite user B to complete the creation together.
  • the rendering tracks given by the corresponding users are rendered and remixed respectively, and then the remixed audio files are sent to each participating user.
  • the interaction modes selected by different users may be different, which are not specifically limited here. For example, as shown in Figure 55, user A selects the interactive mode to interactively control the position of the pair of object A by changing the orientation of the mobile phone he uses, and user B selects the interactive mode to control the position of the pair of object B by changing the orientation of the mobile phone he uses for interactive control.
  • the system can send the remixed audio file to each user participating in the multi-person interactive application.
  • the position of object A and the position of object B in the audio file are different from the user's A corresponds to the manipulation of user B.
  • user A selects an input multimedia file
  • the system identifies the object information in the input file, and feeds back to user A through the UI interface.
  • User A selects the mode. If user A selects the multi-person interaction mode, user A sends a multi-person interaction request to the system, and sends the information of the designated invitee to the system.
  • the system sends an interaction request to user B selected by user A. If user B accepts the request, it sends an acceptance instruction to the system to join the multi-person interactive application created by user A.
  • User A and User B respectively select the sound-emitting object to be operated, and use the above-mentioned rendering mode to control the selected sound-emitting object, and file the corresponding rendering rule.
  • the system separates the single-object audio tracks through the separation network, and renders the separated single-object audio tracks according to the rendering track provided by the user corresponding to the sounding object, and then remixes the rendered single-object audio tracks to obtain the target audio. track, and then send the target audio track to each participating user.
  • the multiplayer interaction mode can be the real-time online multiplayer interaction described in the above example, or the multiplayer interaction in an offline situation.
  • the multimedia file selected by user A is duet music, including singer A and singer B.
  • user A can select the interactive mode to control the rendering effect of singer A, and share the re-rendered target audio track to user B; user B can use the received target audio track shared by user A as Input the file to control the rendering effect of singer B.
  • the interaction modes selected by different users may be the same or different, which are not specifically limited here.
  • real-time and non-real-time interactive rendering control with multiple people is supported, and users can invite other users to re-render and create different sound-emitting objects in multimedia files, enhancing the interactive experience and the fun of the application.
  • the multi-person cooperatively implemented by the above-mentioned method performs the audio-visual control of different objects, thereby realizing the rendering of the multimedia file by the multi-person.
  • An embodiment of the rendering device in the embodiment of the present application includes:
  • Obtaining unit 5801 is used to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
  • a determining unit 5802 configured to determine the first sound source position of the first sounding object based on reference information, the reference information includes reference position information and/or media information of a multimedia file, and the reference position information is used to indicate the first sound source position;
  • the rendering unit 5803 is configured to perform spatial rendering on the first single-object audio track based on the first sound source position, so as to obtain the rendered first single-object audio track.
  • each unit in the rendering device is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 11 , and details are not repeated here.
  • the obtaining unit 5801 obtains the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object; the determining unit 5802 determines the first sound source position of the first sounding object based on the reference information, The rendering unit 5803 spatially renders the first single-object audio track based on the position of the first sound source to obtain the rendered first single-object audio track.
  • the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects.
  • another embodiment of the rendering device in the embodiment of the present application includes:
  • Obtaining unit 5901 is used to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
  • a determining unit 5902 configured to determine the first sound source position of the first sounding object based on reference information, the reference information includes reference position information and/or media information of a multimedia file, and the reference position information is used to indicate the first sound source position;
  • the rendering unit 5903 is configured to perform spatial rendering on the first single-object audio track based on the position of the first sound source, so as to obtain the rendered first single-object audio track.
  • the providing unit 5904 is used to provide a spherical view for the user to select, the center of the spherical view is the position of the user, and the radius of the spherical view is the distance between the user's position and the playback device;
  • the sending unit 5905 is used for sending the target audio track to the playback device, and the playback device is used for playing the target audio track.
  • each unit in the rendering device is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 11 , and details are not repeated here.
  • the obtaining unit 5901 obtains the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object; the determining unit 5902 determines the first sound source position of the first sounding object based on the reference information, The rendering unit 5903 spatially renders the first single-object audio track based on the first sound source position, so as to obtain the rendered first single-object audio track.
  • the stereoscopic sense of space of the first single-object soundtrack corresponding to the first sound-emitting object in the multimedia file can be improved, and the user can be provided with immersive stereoscopic sound effects. In addition, it provides a sound effect experience that is completely different from the traditional music movie interaction mode. It provides a new interactive way for music appreciation.
  • automated 3D re-production enhances the sense of space of binaural music and brings listening to a new level.
  • the interactive method designed by us has been added separately to enhance the user's ability to edit audio, which can be applied to the production of sound-emitting objects in music and film and television works, and simply edit the motion information of specific sound-emitting objects. It also increases the user's control and playability of music, allowing users to experience the fun of making audio by themselves and the ability to control specific sound-emitting objects.
  • an embodiment of the rendering device in the embodiment of the present application includes:
  • the obtaining unit 6001 is further configured to obtain the first single-object soundtrack based on the multimedia file, and the first single-object soundtrack corresponds to the first sounding object;
  • a display unit 6002 configured to display a user interface, where the user interface includes rendering mode options;
  • a determining unit 6003 configured to respond to the user's first operation on the user interface, and determine an automatic rendering mode or an interactive rendering mode from the rendering mode options;
  • the acquiring unit 6001 is further configured to acquire the rendered first single-object audio track based on the preset mode when the determination unit determines the automatic rendering mode; or
  • the obtaining unit 6001 is further configured to obtain reference position information in response to the second operation of the user when the determination unit determines the interactive rendering mode; determine the first sound source position of the first sound-emitting object based on the reference position information; The sound source position renders the first single-object audio track to obtain the rendered first single-object audio track.
  • each unit in the rendering device is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 11 , and details are not repeated here.
  • the determining unit 6003 determines the automatic rendering mode or the interactive rendering mode from the rendering mode options according to the user's first operation. Object track.
  • the spatial rendering of the audio track corresponding to the first sound-emitting object in the multimedia file can be realized through the interaction between the rendering device and the user, so as to provide the user with an immersive stereo sound effect.
  • the rendering device may include a processor 6101 , a memory 6102 and a communication interface 6103 .
  • the processor 6101, the memory 6102 and the communication interface 6103 are interconnected by wires.
  • the memory 6102 stores program instructions and data.
  • the memory 6102 stores program instructions and data corresponding to the steps performed by the rendering device in the corresponding embodiments shown in FIG. 5 to FIG. 11 .
  • the processor 6101 is configured to perform the steps performed by the rendering device shown in any of the foregoing embodiments shown in FIG. 5 to FIG. 11 .
  • the communication interface 6103 may be used to receive and transmit data, and to perform the steps related to acquisition, transmission, and reception in any of the foregoing embodiments shown in FIG. 5 to FIG. 11 .
  • the rendering device may include more or less components relative to FIG. 61 , which are merely illustrative and not limited in this application.
  • This embodiment of the present application also provides a sensor device, as shown in FIG. 62 .
  • the sensor device can be any terminal device including a mobile phone, tablet computer, etc. Taking the sensor as a mobile phone as an example:
  • FIG. 62 is a block diagram showing a partial structure of a sensor device-mobile phone provided by an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 6210, a memory 6220, an input unit 6230, a display unit 6240, a sensor 6250, an audio circuit 6260, a wireless fidelity (WiFi) module 6270, and a processor 6280 , and the power supply 6290 and other components.
  • RF radio frequency
  • a memory 6220 the structure of the mobile phone shown in FIG. 62 does not constitute a limitation on the mobile phone, and may include more or less components than the one shown, or combine some components, or arrange different components.
  • WiFi wireless fidelity
  • the RF circuit 6210 can be used for receiving and sending signals during sending and receiving of information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 6280; in addition, it sends the designed uplink data to the base station.
  • the RF circuit 6210 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 6210 can also communicate with networks and other devices via wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to the global system of mobile communication (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access) multiple access, CDMA), wideband code division multiple access (WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS) and so on.
  • GSM global system of mobile communication
  • general packet radio service general packet radio service
  • GPRS code division multiple access
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • long term evolution long term evolution
  • email short message service
  • the memory 6220 can be used to store software programs and modules, and the processor 6280 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 6220 .
  • the memory 6220 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc.
  • memory 6220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 6230 can be used for receiving inputted numerical or character information, and generating key signal input related to user setting and function control of the mobile phone.
  • the input unit 6230 may include a touch panel 6231 and other input devices 6232 .
  • the touch panel 6231 also known as the touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 6231). operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 6231 may include two parts, a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller.
  • the touch panel 6231 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 6230 may further include other input devices 6232.
  • other input devices 6232 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.
  • the display unit 6240 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 6240 may include a display panel 6241.
  • the display panel 6241 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch panel 6231 can cover the display panel 6241. When the touch panel 6231 detects a touch operation on or near it, it transmits it to the processor 6280 to determine the type of the touch event, and then the processor 6280 determines the type of the touch event according to the touch event. Type provides corresponding visual output on display panel 6241.
  • the touch panel 6231 and the display panel 6241 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 6231 and the display panel 6241 can be integrated to form Realize the input and output functions of the mobile phone.
  • the cell phone may also include at least one sensor 6250, such as light sensors, motion sensors, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 6241 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 6241 and/or when the mobile phone is moved to the ear. or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary.
  • the audio circuit 6260, the speaker 6262, and the microphone 6262 can provide the audio interface between the user and the mobile phone.
  • the audio circuit 6260 can convert the received audio data into an electrical signal, and transmit it to the speaker 6262, and the speaker 6262 converts it into a sound signal for output; on the other hand, the microphone 6262 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 6260 After receiving, it is converted into audio data, and then the audio data is output to the processor 6280 for processing, and then sent to, for example, another mobile phone through the RF circuit 6210, or the audio data is output to the memory 6220 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • the mobile phone can help users to send and receive emails, browse web pages and access streaming media through the WiFi module 6270. It provides users with wireless broadband Internet access.
  • FIG. 62 shows the WiFi module 6270, it can be understood that it is not a necessary component of the mobile phone.
  • the processor 6280 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 6220, and calling the data stored in the memory 6220.
  • the processor 6280 may include one or more processing units; preferably, the processor 6280 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 6280.
  • the mobile phone also includes a power supply 6290 (such as a battery) for supplying power to various components.
  • a power supply 6290 (such as a battery) for supplying power to various components.
  • the power supply can be logically connected to the processor 6280 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, and the like, which will not be repeated here.
  • the processor 6280 included in the mobile phone may perform the functions in the foregoing embodiments shown in FIG. 5 to FIG. 11 , which will not be repeated here.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Stereophonic System (AREA)

Abstract

一种渲染方法及相关设备,可以应用于音乐、影视作品制作等场景,该方法可以由渲染设备执行,也可以由渲染设备的部件(例如处理器、芯片、或芯片***等)执行。该方法包括:基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;基于参考信息确定第一发声对象的第一声源位置,基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。

Description

一种渲染方法及相关设备
本申请要求于2021年4月29日提交中国专利局、申请号为CN202110477321.0、发明名称为“一种渲染方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频应用领域,尤其涉及一种渲染方法及相关设备。
背景技术
随着音视频播放技术的不断成熟,人们对音视频播放设备的播放效果要求越来越高。
目前,为了使用户播放音视频时可以体验到逼真的立体环绕音效,音视频播放设备可以采用头相关传输函数(head related transfer function,HRTF)等处理技术对待播放的音视频数据进行处理。
但是,网上的大量音视频数据(例如:音乐、影视作品等)均为双声道/多声道的音轨。如何对音轨中的单发声对象进行空间渲染是亟待解决的问题。
发明内容
本申请实施例提供了一种渲染方法,可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。
本申请实施例第一方面提供了一种渲染方法,可以应用于音乐、影视作品制作等场景,该方法可以由渲染设备执行,也可以由渲染设备的部件(例如处理器、芯片、或芯片***等)执行。该方法包括:基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;基于参考信息确定第一发声对象的第一声源位置,参考信息包括参考位置信息和/或多媒体文件的媒体信息,参考位置信息用于指示第一声源位置;基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
本申请实施例中,基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;基于参考信息确定第一发声对象的第一声源位置,基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨音轨。可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的媒体信息包括:多媒体文件中需要显示的文字、多媒体文件中需要显示的图像、多媒体文件中需要播放的音乐的音乐特征以及第一发声对象对应的声源类型中的至少一种。
该种可能的实现方式中,若媒体信息包括音乐特征,渲染设备可以根据音乐的音乐特征来对提取出来的特定发声对象进行方位和动态的设置,使得对该发声对象对应的音轨在3D渲染中更加自然,艺术性也有更好地体现。若媒体信息包括文字、图像等,在耳机或外放环境下渲染3D沉浸感,做到了真正的音随画动,使得用户获取最优的音效体验。此外,若媒体信息包括视频,追踪视频中的发声对象,并对整段视频里发声对象对应的音轨进行渲染,也可 以应用在专业的混音后期制作中,提升混音师的工作效率。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
该种可能的实现方式中,若参考位置信息包括传感器的第一位置信息,用户可以通过传感器提供的朝向或位置对选定的发声对象进行实时或者后期的动态渲染。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。若参考位置信息包括用户选择的第二位置信息,用户可以通过界面拖拽的方法来来控制选定的发声对象并进行实时或者后期的动态渲染,赋予其具体的空间方位以及运动,可以实现用户与音频进行交互创作,提供用户新的体验。此外,还可以在用户没有任何传感器的情况下对发声对象进行声像的编辑。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:确定播放设备的类型,播放设备用于播放目标音轨,目标音轨根据渲染后的第一单对象音轨获取;基于第一声源位置对第一单对象音轨进行空间渲染,包括:基于第一声源位置以及播放设备的类型对第一单对象音轨进行空间渲染。
该种可能的实现方式中,在对音轨进行空间渲染时,考虑播放设备的类型。不同的播放设备类型可以对应不同的空间渲染公式,实现后期再播放设备播放渲染后的第一单对象音轨的空间效果更加真实和准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的参考信息包括媒体信息,当媒体信息包括图像且图像包括第一发声对象时,基于参考信息确定第一发声对象的第一声源位置,包括:确定图像内第一发声对象的第三位置信息,第三位置信息包括第一发声对象在图像内的二维坐标以及深度;基于第三位置信息获取第一声源位置。
该种可能的实现方式中,结合音频视频图像的多模态,提取发声对象的坐标以及单对象音轨后,在耳机或外放环境下渲染3D沉浸感,做到了真正的音随画动,使得用户获取最优的音效体验。此外,选择发声对象后在整段视频里进行跟踪渲染对象音频的技术也可应用在专业的混音后期制作中,提升混音师的工作效率。通过将视频中音频的单对象音轨分离出来,并通过对视频图像中的发声对象的分析和跟踪,获取发声对象的运动信息,来对选定的发声对象并进行实时或者后期的动态渲染。实现视频画面与音频声源方向的匹配,提升用户体验。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的参考信息包括媒体信息,当媒体信息包括多媒体文件中需要播放的音乐的音乐特征时,基于参考信息确定第一发声对象的第一声源位置,包括:基于关联关系与音乐特征确定第一声源位置,关联关系用于表示音乐特征与第一声源位置的关联。
该种可能的实现方式中,根据音乐的音乐特征来对提取出来的特定发声对象进行方位和动态的设置,使得3D渲染更加自然,艺术性也有更好地体现。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的媒体信息包括媒体信息,当媒体信息包括多媒体文件中需要显示的文字且文字包含有与位置相关的位置文字时,基于参考信息确定第一发声对象的第一声源位置,包括:识别位置文字;基于位置文字确定第一声源位置。
该种可能的实现方式中,通过识别与位置相关的位置文字,在耳机或外放环境下渲染3D 沉浸感,做到了与位置文字对应的空间感,使得用户获取最优的音效体验。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的参考信息包括参考位置信息,当参考位置信息包括第一位置信息时,基于参考信息确定第一发声对象的第一声源位置前,方法还包括:获取第一位置信息,第一位置信息包括传感器的第一姿态角以及传感器与播放设备之间的距离;基于参考信息确定第一发声对象的第一声源位置,包括:将第一位置信息转化为第一声源位置。
该种可能的实现方式中,用户可以通过传感器提供的朝向(即第一姿态角)对选定的发声对象进行实时或者后期的动态渲染。此时传感器就类似一个激光笔,激光的指向就是声源位置。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的参考信息包括参考位置信息,当参考位置信息包括第一位置信息时,基于参考信息确定第一发声对象的第一声源位置前,方法还包括:获取第一位置信息,第一位置信息包括传感器的第二姿态角以及传感器的加速度;基于参考信息确定第一发声对象的第一声源位置,包括:将第一位置信息转化为第一声源位置。
该种可能的实现方式中,用户可以通过传感器的实际位置信息作为声源位置来控制该发声对象对象并进行实时或者后期的动态渲染,发声对象的行动轨迹就可以简单地完全由用户控制,大大增加编辑的灵活性。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的参考信息包括参考位置信息,当参考位置信息包括第二位置信息时,基于参考信息确定第一发声对象的第一声源位置前,方法还包括:提供球视图供用户选择,球视图的圆心为用户所在的位置,球视图的半径为用户的位置与播放设备的距离;获取用户在球视图中选择的第二位置信息;基于参考信息确定第一发声对象的第一声源位置,包括:将第二位置信息转化为第一声源位置。
该种可能的实现方式中,用户可以通过球视图选择第二位置信息(例如点击、拖拽、滑动等操作)来控制选定的发声对象并进行实时或者后期的动态渲染,赋予其具体的空间方位以及运动,可以实现用户与音频进行交互创作,提供用户新的体验。此外,还可以在用户没有任何传感器的情况下对发声对象进行声像的编辑。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于多媒体文件获取第一单对象音轨,包括:从多媒体文件中的原始音轨中分离出第一单对象音轨,原始音轨至少由第一单对象音轨以及第二单对象音轨合成获取,第二单对象音轨与第二发声对象对应。
该种可能的实现方式中,在原始音轨至少由第一单对象音轨以及第二单对象音轨合成的情况下,通过分离出第一单对象音轨,可以对音轨中的特定发声对象建空间渲染,增强用户对音频的编辑能力,可应用于音乐,影视作品的对象制作。增加用户对音乐的操控性与可玩性。
可选地,在第一方面的一种可能的实现方式中,上述步骤:从多媒体文件中的原始音轨中分离出第一单对象音轨,包括:通过训练好的分离网络从原始音轨中分离出第一单对象音轨。
该种可能的实现方式中,在原始音轨至少由第一单对象音轨以及第二单对象音轨合成的 情况下,通过分离网络分离出第一单对象音轨,可以对原始音轨中的特定发声对象进行空间渲染,增强用户对音频的编辑能力,可应用于音乐,影视作品的对象制作。增加用户对音乐的操控性与可玩性。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的训练好的分离网络是通过以训练数据作为分离网络的输入,以损失函数的值小于第一阈值为目标对分离网络进行训练获取,训练数据包括训练音轨,训练音轨至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,初始第三单对象音轨与第三发声对象对应,初始第四单对象音轨与第四发声对象对应,第三发声对象与第一发声对象的属于相同类型,第二发声对象与第四发声对象的属于相同类型,分离网络的输出包括分离获取的第三单对象音轨;损失函数用于指示分离获取的第三单对象音轨与初始第三单对象音轨之间的差异。
该种可能的实现方式中,以减小损失函数的值为目标对分离网络进行训练,也就是不断缩小分离网络输出第三单对象音轨与初始第三单对象音轨之间的差异。从而使得分离网络分离出来的单对象音轨更加准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤基于第一声源位置以及播放设备的类型对第一单对象音轨进行空间渲染,包括:若播放设备为耳机,通过如下公式获取渲染后的第一单对象音轨;
Figure PCTCN2022087353-appb-000001
其中,
Figure PCTCN2022087353-appb-000002
为渲染后的第一单对象音轨,S为多媒体文件的发声对象且包括第一发声对象,i指示左声道或右声道,a s(t)为t时刻下第一发声对象的调节系数,h i,s(t)为t时刻下的第一发声对象对应的左声道或右声道的头相关传输函数HRTF滤波器系数,HRTF滤波器系数与第一声源位置相关,o s(t)为t时刻下的第一单对象音轨,τ为积分项。
该种可能的实现方式中,在播放设备为耳机时,解决如何获取渲染后的第一单对象音轨的技术问题。
可选地,在第一方面的一种可能的实现方式中,上述步骤基于第一声源位置以及播放设备的类型对第一单对象音轨进行空间渲染,包括:若播放设备为N个外放设备,通过如下公式获取渲染后的第一单对象音轨;
Figure PCTCN2022087353-appb-000003
其中,
Figure PCTCN2022087353-appb-000004
其中,
Figure PCTCN2022087353-appb-000005
其中,
Figure PCTCN2022087353-appb-000006
为渲染后的第一单对象音轨,i指示多声道中的第i个声道,S为多媒体文件的发声对象且包括第一发声对象,a s(t)为t时刻下第一发声对象的调节系数,g s(t)代表t时刻下第一发声对象的平移系数,o s(t)为t时刻下的第一单对象音轨,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为校准器校准第i个外放设备所获取的倾斜角,r i为第i个外放设备与校准器的距离,N为正整数,i为正整数且i≤N,第一声源位置在N个外放设备构成的四面体内。
该种可能的实现方式中,在播放设备为外放设备时,解决如何获取渲染后的第一单对象音轨的技术问题。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于渲染后的第一单对象音轨、多媒体文件中的原始音轨以及播放设备的类型,获取目标音轨;向播放设备发送目标音轨,播放设备用于播放目标音轨。
该种可能的实现方式中,可以获取目标音轨,便于对渲染后的音轨进行保存,便于后续播放,减少重复渲染操作。
可选地,在第一方面的一种可能的实现方式中,上述步骤基于渲染后的第一单对象音轨、多媒体文件中的原始音轨以及播放设备的类型,获取目标音轨包括:若播放设备的类型为耳机,通过如下公式获取目标音轨:
Figure PCTCN2022087353-appb-000007
其中,i指示左声道或右声道,
Figure PCTCN2022087353-appb-000008
为t时刻下的目标音轨,X i(t)为t时刻下的原始音轨,
Figure PCTCN2022087353-appb-000009
为t时刻下未被渲染的第一单对象音轨,
Figure PCTCN2022087353-appb-000010
为渲染后的第一单对象音轨,a s(t)为t时刻下第一发声对象的调节系数,h i,s(t)为t时刻下的第一发声对象对应的左声道或右声道头相关传输函数HRTF滤波器系数,HRTF滤波器系数与第一声源位置相关,o s(t)为t时刻下的第一单对象音轨,τ为积分项,S 1为原始音轨中需要被替换的发声对象,若第一发声对象是替换原始音轨中的发声对象,则S 1为空集;S 2为目标音轨相较于原始音轨增加的发声对象,若第一发声对象是复制的原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为多媒体文件的发声对象且包括第一发声对象。
该种可能的实现方式中,在播放设备为耳机时,解决如何获取目标音轨的技术问题。便于对渲染后的音轨进行保存,便于后续播放,减少重复渲染操作。
可选地,在第一方面的一种可能的实现方式中,上述步骤基于渲染后的第一单对象音轨、多媒体文件中的原始音轨以及播放设备的类型,获取目标音轨,包括:若播放设备的类型为 N个外放设备,通过如下公式获取目标音轨:
Figure PCTCN2022087353-appb-000011
其中,
Figure PCTCN2022087353-appb-000012
Figure PCTCN2022087353-appb-000013
其中,i指示多声道中的第i个声道,
Figure PCTCN2022087353-appb-000014
为t时刻下的目标音轨,X i(t)为t时刻下的原始音轨,
Figure PCTCN2022087353-appb-000015
为t时刻下未被渲染的第一单音轨,
Figure PCTCN2022087353-appb-000016
为渲染后的第一单对象音轨,a s(t)为t时刻下第一发声对象的调节系数,g s(t)代表t时刻下第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为t时刻下的第一单对象音轨,S 1为原始音轨中需要被替换的发声对象,若第一发声对象是替换原始音轨中的发声对象,则S 1为空集;S 2为目标音轨相较于原始音轨增加的发声对象,若第一发声对象是复制原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为多媒体文件的发声对象且包括第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为校准器校准第i个外放设备所获取倾斜角,r i为第i个外放设备与校准器的距离,N为正整数,i为正整数且i≤N,第一声源位置在N个外放设备构成的四面体内。
该种可能的实现方式中,在播放设备为外放设备时,解决如何获取目标音轨的技术问题。便于对渲染后的音轨进行保存,便于后续播放,减少重复渲染操作。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的音乐特征包括:音乐结构、音乐情感和歌唱模式中的至少一种。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:从多媒体文件中分离出第二单对象音轨;基于参考信息确定第二发声对象的第二声源位置,基于第二声源位置对第二单对象音轨进行空间渲染,以获取渲染后的第二单对象音轨。
该种可能的实现方式中,可以从多媒体文件中分离出至少两个单对象音轨,并进行相应的空间渲染,增强用户对音频中特定发声对象的编辑能力,可应用于音乐,影视作品的对象 制作。增加用户对音乐的操控性与可玩性。
本申请实施例第二方面提供了一种渲染方法,该方法可以由渲染设备执行,可以应用于音乐、影视作品制作等场景,也可以由渲染设备的部件(例如处理器、芯片、或芯片***等)执行。该方法包括:获取多媒体文件;基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;显示用户界面,用户界面包括渲染方式选项;响应用户在用户界面的第一操作,从渲染方式选项中确定自动渲染方式或互动渲染方式;当确定的是自动渲染方式时,基于预设方式获取渲染后的第一单对象音轨;或当确定的是互动渲染方式时,响应于用户的第二操作以获得参考位置信息;基于参考位置信息确定第一发声对象的第一声源位置;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
本申请实施例中,渲染设备根据用户的第一操作,从渲染方式选项中确定自动渲染方式或互动渲染方式。一方面,渲染设备可以基于用户的第一操作自动获取渲染后的第一单对象音轨。另一方面,可以通过渲染设备与用户之间的交互,实现多媒体文件中第一发声对象对应的音轨的空间渲染,为用户提供身临其境的立体音效。
可选地,在第二方面的一种可能的实现方式中,上述步骤中的预设方式包括:获取多媒体文件的媒体信息;基于媒体信息确定第一发声对象的第一声源位置;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
该种可能的实现方式中,
可选地,在第二方面的一种可能的实现方式中,上述步骤中的媒体信息包括:多媒体文件中需要显示的文字、多媒体文件中需要显示的图像、多媒体文件中需要播放的音乐的音乐特征以及第一发声对象对应的声源类型中的至少一种。
该种可能的实现方式中,渲染设备通过与用户的交互,确定待处理的多媒体文件,增加用户对多媒体文件中音乐的操控性与可玩性。
可选地,在第二方面的一种可能的实现方式中,上述步骤中的参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
该种可能的实现方式中,在对音轨进行空间渲染时,通过用户的操作确定播放设备的类型。不同的播放设备类型可以对应不同的空间渲染公式,实现后期再播放设备播放渲染后的音轨的空间效果更加真实和准确。
可选地,在第二方面的一种可能的实现方式中,上述步骤:当媒体信息包括图像且图像包括第一发声对象时,基于媒体信息确定第一发声对象的第一声源位置,包括:呈现图像;确定图像内第一发声对象的第三位置信息,第三位置信息包括第一发声对象在图像内的二维坐标以及深度;基于第三位置信息获取第一声源位置。
该种可能的实现方式中,渲染设备可以自动呈现图像并确定图像中的发声对象,获取该发声对象的第三位置信息,进而获取第一声源位置。该种方式下渲染设备可以自动识别多媒体文件,当多媒体文件包括图像,且图像包括第一发声对象时,渲染设备可以自动获取渲染后的第一单对象音轨。自动提取发声对象的坐标以及单对象音轨后,在耳机或外放环境下渲染3D沉浸感,做到了真正的音随画动,使得用户获取最优的音效体验。
可选地,在第二方面的一种可能的实现方式中,上述步骤:确定图像内第一发声对象的第三位置信息,包括:响应用户对图像的第三操作,确定第一发声对象的第三位置信息。
该种可能的实现方式中,用户可以在呈现的图像从多个发声对象中选择第一发声对象,即可以选择渲染的第一发声对象对应的第一单对象音轨。根据用户的操作提取发声对象的坐标以及单对象音轨,在耳机或外放环境下渲染3D沉浸感,做到了真正的音随画动,使得用户获取最优的音效体验。
可选地,在第二方面的一种可能的实现方式中,当上述的媒体信息包括多媒体文件中需要播放的音乐的音乐特征时,基于媒体信息确定第一发声对象的第一声源位置,包括:识别音乐特征;基于关联关系与音乐特征确定第一声源位置,关联关系用于表示音乐特征与第一声源位置的关联。
该种可能的实现方式中,根据音乐的音乐特征来对提取出来的特定发声对象进行方位和动态的设置,使得3D渲染更加自然,艺术性也有更好地体现。
可选地,在第二方面的一种可能的实现方式中,当上述的媒体信息包括多媒体文件中需要显示的文字且文字包含有与位置相关的位置文字时,基于媒体信息确定第一发声对象的第一声源位置,包括:识别位置文字;基于位置文字确定第一声源位置。
该种可能的实现方式中,通过识别与位置相关的位置文字,在耳机或外放环境下渲染3D沉浸感,做到了与位置文字对应的空间感,使得用户获取最优的音效体验。
可选地,在第二方面的一种可能的实现方式中,当上述的参考位置信息包括第一位置信息时,响应于用户的第二操作以获得参考位置信息,包括:响应用户对传感器的第二操作,获取第一位置信息,第一位置信息包括传感器的第一姿态角以及传感器与播放设备之间的距离;基于参考位置信息确定第一发声对象的第一声源位置,包括:将第一位置信息转化为第一声源位置。
该种可能的实现方式中,用户可以通过传感器提供的朝向(即第一姿态角)对选定的发声对象进行实时或者后期的动态渲染。此时传感器就类似一个激光笔,激光的指向就是声源位置。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。
可选地,在第二方面的一种可能的实现方式中,当上述的参考位置信息包括第一位置信息时,响应于用户的第二操作以获得参考位置信息,包括:响应用户对传感器的第二操作,获取第一位置信息,第一位置信息包括传感器的第二姿态角以及传感器的加速度;基于参考位置信息确定第一发声对象的第一声源位置,包括:将第一位置信息转化为第一声源位置。
该种可能的实现方式中,用户可以通过传感器的实际位置信息作为声源位置来控制该发声对象对象并进行实时或者后期的动态渲染,发声对象的行动轨迹就可以简单地完全由用户控制,大大增加编辑的灵活性。
可选地,在第二方面的一种可能的实现方式中,当上述的参考位置信息包括第二位置信息时,响应于用户的第二操作以获得参考位置信息,包括:呈现球视图,球视图的圆心为用户所在的位置,球视图的半径为用户的位置与播放设备的距离;响应用户的第二操作,在球视图中确定第二位置信息;基于参考位置信息确定第一发声对象的第一声源位置,包括:将第二位置信息转化为第一声源位置。
该种可能的实现方式中,用户可以通过球视图选择第二位置信息(例如点击、拖拽、滑动等操作)来控制选定的发声对象并进行实时或者后期的动态渲染,赋予其具体的空间方位 以及运动,可以实现用户与音频进行交互创作,提供用户新的体验。此外,还可以在用户没有任何传感器的情况下对发声对象进行声像的编辑。
可选地,在第二方面的一种可能的实现方式中,上述步骤:获取多媒体文件,包括:响应用户的第四操作,从存储的至少一个多媒体文件中确定多媒体文件。
该种可能的实现方式中,可以基于用户的选择,从存储的至少一个多媒体文件中确定多媒体文件,进而实现对用户选择的多媒体文件中第一发声对象对应的第一单对象音轨渲染制作,提升用户体验。
可选地,在第二方面的一种可能的实现方式中,上述的用户界面还包括播放设备类型选项;方法还包括:响应用户的第五操作,从播放设备类型选项中确定播放设备的类型;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨,包括:基于第一声源位置以及类型对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
该种可能的实现方式中,根据用户使用播放设备的类型,选择适合用户正在使用播放设备的渲染方式,进而提升播放设备的渲染效果,使得3D渲染更加自然。
可选地,在第二方面的一种可能的实现方式中,上述步骤:基于多媒体文件获取第一单对象音轨,包括:从多媒体文件中的原始音轨中分离出第一单对象音轨,原始音轨至少由第一单对象音轨以及第二单对象音轨合成获取,第二单对象音轨与第二发声对象对应。
该种可能的实现方式中,在原始音轨至少由第一单对象音轨以及第二单对象音轨合成的情况下,通过分离出第一单对象音轨,可以对音轨中的特定发声对象建空间渲染,增强用户对音频的编辑能力,可应用于音乐,影视作品的对象制作。增加用户对音乐的操控性与可玩性。
该种可能的实现方式中,可以从多媒体文件中分离出第一单对象音轨,实现用于对多媒体文件中特定发声对象对应的单对象音轨进行渲染,提升用户对音频的创作,提升用户体验。
该种可能的实现方式中,用户可以通过传感器提供的朝向或位置对选定的发声对象进行实时或者后期的动态渲染。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。
该种可能的实现方式中,用户可以通过传感器提供的朝向(即第一姿态角)对选定的发声对象进行实时或者后期的动态渲染。此时传感器就类似一个激光笔,激光的指向就是声源位置。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。
该种可能的实现方式中,通过传感器的实际位置信息作为声源位置来控制该发声对象对象并进行实时或者后期的动态渲染,发声对象的行动轨迹就可以简单地完全由用户控制,大大增加编辑的灵活性。
该种可能的实现方式中,用户可以通过界面拖拽的方法来来控制选定的发声对象并进行实时或者后期的动态渲染,赋予其具体的空间方位以及运动,可以实现用户与音频进行交互创作,提供用户新的体验。此外,还可以在用户没有任何传感器的情况下对发声对象进行声像的编辑。
该种可能的实现方式中,渲染设备可以根据音乐的音乐特征来对提取出来的特定发声对象进行方位和动态的设置,使得对该发声对象对应的音轨在3D渲染中更加自然,艺术性也有 更好地体现。
该种可能的实现方式中,在耳机或外放环境下渲染3D沉浸感,做到了真正的音随画动,使得用户获取最优的音效体验。
该种可能的实现方式中,渲染设备确定发声对象后可以自动追踪视频中的发声对象,并对整段视频里发声对象对应的音轨进行渲染,也可以应用在专业的混音后期制作中,提升混音师的工作效率。
该种可能的实现方式中,渲染设备可以根据用户的第四操作确定图像内的发声对象,并对图像中的发声对象进行跟踪,对该发声对象对应的音轨进行渲染。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。
可选地,在第二方面的一种可能的实现方式中,上述步骤中的音乐特征包括:音乐结构、音乐情感和歌唱模式中的至少一种。
可选地,在第二方面的一种可能的实现方式中,上述步骤还包括:从原始音轨中分离出第二单对象音轨;基于参考信息确定第二发声对象的第二声源位置,基于第二声源位置对第二单对象音轨进行空间渲染,以获取渲染后的第二单对象音轨。
该种可能的实现方式中,可以从原始音轨中分离出至少两个单对象音轨,并进行相应的空间渲染,增强用户对音频中特定发声对象的编辑能力,可应用于音乐,影视作品的对象制作。增加用户对音乐的操控性与可玩性。
本申请第三方面提供一种渲染设备,该渲染设备可以应用于音乐、影视作品制作等场景,该渲染设备包括:
获取单元,用于基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;
确定单元,用于基于参考信息确定第一发声对象的第一声源位置,参考信息包括参考位置信息和/或多媒体文件的媒体信息,参考位置信息用于指示第一声源位置;
渲染单元,用于基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
可选地,在第三方面的一种可能的实现方式中,上述的媒体信息包括:多媒体文件中需要显示的文字、多媒体文件中需要显示的图像、多媒体文件中需要播放的音乐的音乐特征以及第一发声对象对应的声源类型中的至少一种。
可选地,在第三方面的一种可能的实现方式中,上述的参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
可选地,在第三方面的一种可能的实现方式中,上述的确定单元,还用于确定播放设备的类型,播放设备用于播放目标音轨,目标音轨根据渲染后的第一单对象音轨获取;渲染单元,具体用于基于第一声源位置以及播放设备的类型对第一单对象音轨进行空间渲染。
可选地,在第三方面的一种可能的实现方式中,上述的参考信息包括媒体信息,当媒体信息包括图像且图像包括第一发声对象时,确定单元,具体用于确定图像内第一发声对象的第三位置信息,第三位置信息包括第一发声对象在图像内的二维坐标以及深度;确定单元,具体用于基于第三位置信息获取第一声源位置。
可选地,在第三方面的一种可能的实现方式中,上述的参考信息包括媒体信息,当媒体 信息包括多媒体文件中需要播放的音乐的音乐特征时,确定单元,具体用于基于关联关系与音乐特征确定第一声源位置,关联关系用于表示音乐特征与第一声源位置的关联。
可选地,在第三方面的一种可能的实现方式中,上述的媒体信息包括媒体信息,当媒体信息包括多媒体文件中需要显示的文字且文字包含有与位置相关的位置文字时,确定单元,具体用于识别位置文字;确定单元,具体用于基于位置文字确定第一声源位置。
可选地,在第三方面的一种可能的实现方式中,上述的参考信息包括参考位置信息,当参考位置信息包括第一位置信息时,获取单元,还用于获取第一位置信息,第一位置信息包括传感器的第一姿态角以及传感器与播放设备之间的距离;确定单元,具体用于将第一位置信息转化为第一声源位置。
可选地,在第三方面的一种可能的实现方式中,上述的参考信息包括参考位置信息,当参考位置信息包括第一位置信息时,获取单元,还用于获取第一位置信息,第一位置信息包括传感器的第二姿态角以及传感器的加速度;确定单元,具体用于将第一位置信息转化为第一声源位置。
可选地,在第三方面的一种可能的实现方式中,上述的参考信息包括参考位置信息,当参考位置信息包括第二位置信息时,渲染设备还包括:提供单元,用于提供球视图供用户选择,球视图的圆心为用户所在的位置,球视图的半径为用户的位置与播放设备的距离;获取单元,还用于获取用户在球视图中选择的第二位置信息;确定单元,具体用于将第二位置信息转化为第一声源位置。
可选地,在第三方面的一种可能的实现方式中,上述的获取单元,具体用于从多媒体文件中的原始音轨中分离出第一单对象音轨,原始音轨至少由第一单对象音轨以及第二单对象音轨合成获取,第二单对象音轨与第二发声对象对应。
可选地,在第三方面的一种可能的实现方式中,上述的获取单元,具体用于通过训练好的分离网络从原始音轨中分离出第一单对象音轨。
可选地,在第三方面的一种可能的实现方式中,上述的训练好的分离网络是通过以训练数据作为分离网络的输入,以损失函数的值小于第一阈值为目标对分离网络进行训练获取,训练数据包括训练音轨,训练音轨至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,初始第三单对象音轨与第三发声对象对应,初始第四单对象音轨与第四发声对象对应,第三发声对象与第一发声对象的属于相同类型,第二发声对象与第四发声对象的属于相同类型,分离网络的输出包括分离获取的第三单对象音轨;损失函数用于指示分离获取的第三单对象音轨与初始第三单对象音轨之间的差异。
可选地,在第三方面的一种可能的实现方式中,若上述的播放设备为耳机,获取单元,具体用于通过如下公式获取渲染后的第一单对象音轨;
Figure PCTCN2022087353-appb-000017
其中,
Figure PCTCN2022087353-appb-000018
为渲染后的第一单对象音轨,S为多媒体文件的发声对象且包括第一发声对象,i指示左声道或右声道,a s(t)为t时刻下第一发声对象的调节系数,h i,s(t)为t时刻下第一发声对象对应的左声道或右声道的头相关传输函数HRTF滤 波器系数,HRTF滤波器系数与第一声源位置相关,o s(t)为t时刻下的第一单对象音轨,τ为积分项。
可选地,在第三方面的一种可能的实现方式中,若上述的播放设备为N个外放设备,获取单元,具体用于通过如下公式获取渲染后的第一单对象音轨;
Figure PCTCN2022087353-appb-000019
其中,
Figure PCTCN2022087353-appb-000020
其中,
Figure PCTCN2022087353-appb-000021
其中,
Figure PCTCN2022087353-appb-000022
为渲染后的第一单对象音轨,i指示多声道中的第i个声道,S为多媒体文件的发声对象且包括第一发声对象,a s(t)为t时刻下第一发声对象的调节系数,g s(t)代表t时刻下第一发声对象的平移系数,o s(t)为t时刻下的第一单对象音轨,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为校准器校准第i个外放设备所获取的倾斜角,r i为第i个外放设备与校准器的距离,N为正整数,i为正整数且i≤N,第一声源位置在N个外放设备构成的四面体内。
可选地,在第三方面的一种可能的实现方式中,上述的获取单元,还用于基于渲染后的第一单对象音轨以及多媒体文件中的原始音轨,获取目标音轨;渲染设备还包括:发送单元,用于向播放设备发送目标音轨,播放设备用于播放目标音轨。
可选地,在第三方面的一种可能的实现方式中,若上述的播放设备为耳机,获取单元,具体用于通过如下公式获取目标音轨:
Figure PCTCN2022087353-appb-000023
其中,i指示左声道或右声道,
Figure PCTCN2022087353-appb-000024
为t时刻下的目标音轨,X i(t)为t时刻下的原始音轨,
Figure PCTCN2022087353-appb-000025
为t时刻下未被渲染的第一单对象音轨,
Figure PCTCN2022087353-appb-000026
为渲染后的第一单对象音轨,a s(t)为t时刻下第一发声对象的调节系数,h i,s(t)为t时刻下第一发声对象对应的左声道或右声道的头相关传输函数HRTF滤波器系数,HRTF滤波器系数与第一声源位置相关,o s(t)为t时刻下的第一单对象音轨,τ为积分项,S 1为原始音轨中 需要被替换的发声对象,若第一发声对象是替换原始音轨中的发声对象,则S 1为空集;S 2为目标音轨相较于原始音轨增加的发声对象,若第一发声对象是复制原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为多媒体文件的发声对象且包括第一发声对象。
可选地,在第三方面的一种可能的实现方式中,若上述的播放设备为N个外放设备;获取单元,具体用于通过如下公式获取目标音轨:
Figure PCTCN2022087353-appb-000027
其中,
Figure PCTCN2022087353-appb-000028
Figure PCTCN2022087353-appb-000029
其中,i指示多声道中的第i个声道,
Figure PCTCN2022087353-appb-000030
为t时刻下的目标音轨,X i(t)为t时刻下的原始音轨,
Figure PCTCN2022087353-appb-000031
为t时刻下未被渲染的第一单音轨,
Figure PCTCN2022087353-appb-000032
为渲染后的第一单对象音轨,a s(t)为t时刻下第一发声对象的调节系数,g s(t)代表t时刻下第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为t时刻下的第一单对象音轨,S 1为原始音轨中需要被替换的发声对象,若第一发声对象是替换原始音轨中的发声对象,则S 1为空集;S 2为目标音轨相较于原始音轨增加的发声对象,若第一发声对象是复制原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为多媒体文件的发声对象且包括第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为校准器校准第i个外放设备所获取的倾斜角,r i为第i个外放设备与校准器的距离,N为正整数,i为正整数且i≤N,第一声源位置在N个外放设备构成的四面体内。
本申请第四方面提供一种渲染设备,该渲染设备可以应用于音乐、影视作品制作等场景,该渲染设备包括:
获取单元,用于获取多媒体文件;
获取单元,还用于基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;
显示单元,用于显示用户界面,用户界面包括渲染方式选项;
确定单元,用于响应用户在用户界面的第一操作,从渲染方式选项中确定自动渲染方式或互动渲染方式;
获取单元,还用于当确定单元确定的是自动渲染方式时,基于预设方式获取渲染后的第一单对象音轨;或
获取单元,还用于当确定单元确定的是互动渲染方式时,响应于用户的第二操作以获得参考位置信息;基于参考位置信息确定第一发声对象的第一声源位置;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
可选地,在第四方面的一种可能的实现方式中,上述的预设方式包括:获取单元,还用于获取多媒体文件的媒体信息;确定单元,还用于基于媒体信息确定第一发声对象的第一声源位置;获取单元,还用于基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
可选地,在第四方面的一种可能的实现方式中,上述的媒体信息包括:多媒体文件中需要显示的文字、多媒体文件中需要显示的图像、多媒体文件中需要播放的音乐的音乐特征以及第一发声对象对应的声源类型中的至少一种。
可选地,在第四方面的一种可能的实现方式中,上述的参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
可选地,在第四方面的一种可能的实现方式中,当上述的媒体信息包括图像且图像包括第一发声对象时,确定单元,具体用于呈现图像;确定单元,具体用于确定图像内第一发声对象的第三位置信息,第三位置信息包括第一发声对象在图像内的二维坐标以及深度;确定单元,具体用于基于第三位置信息获取第一声源位置。
可选地,在第四方面的一种可能的实现方式中,上述的确定单元,具体用于响应用户对图像的第三操作,确定第一发声对象的第三位置信息。
可选地,在第四方面的一种可能的实现方式中,当上述的媒体信息包括多媒体文件中需要播放的音乐的音乐特征时,确定单元,具体用于识别音乐特征;
确定单元,具体用于基于关联关系与音乐特征确定第一声源位置,关联关系用于表示音乐特征与第一声源位置的关联。
可选地,在第四方面的一种可能的实现方式中,当上述的媒体信息包括多媒体文件中需要显示的文字且文字包含有与位置相关的位置文字时,确定单元,具体用于识别位置文字;确定单元,具体用于基于位置文字确定第一声源位置。
可选地,在第四方面的一种可能的实现方式中,当上述的参考位置信息包括第一位置信息时,确定单元,具体用于响应用户对传感器的第二操作,获取第一位置信息,第一位置信息包括传感器的第一姿态角以及传感器与播放设备之间的距离;确定单元,具体用于将第一位置信息转化为第一声源位置。
可选地,在第四方面的一种可能的实现方式中,当上述的参考位置信息包括第一位置信息时,确定单元,具体用于响应用户对传感器的第二操作,获取第一位置信息,第一位置信息包括传感器的第二姿态角以及传感器的加速度;确定单元,具体用于将第一位置信息转化为第一声源位置。
可选地,在第四方面的一种可能的实现方式中,当上述的参考位置信息包括第二位置信息时,确定单元,具体用于呈现球视图,球视图的圆心为用户所在的位置,球视图的半径为用户的位置与播放设备的距离;确定单元,具体用于响应用户的第二操作,在球视图中确定第二位置信息;确定单元,具体用于将第二位置信息转化为第一声源位置。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于响应用户的第四操作,从存储的至少一个多媒体文件中确定多媒体文件。
可选地,在第四方面的一种可能的实现方式中,上述的用户界面还包括播放设备类型选项;确定单元,还用于响应用户的第五操作,从播放设备类型选项中确定播放设备的类型;获取单元,具体用于基于第一声源位置以及类型对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于从多媒体文件中的原始音轨中分离出第一单对象音轨,原始音轨至少由第一单对象音轨以及第二单对象音轨合成获取,第二单对象音轨与第二发声对象对应。
可选地,在第四方面的一种可能的实现方式中,上述的音乐特征包括:音乐结构、音乐情感和歌唱模式中的至少一种。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元还用于从多媒体文件中分离出第二单对象音轨;确定第二发声对象的第二声源位置,基于第二声源位置对第二单对象音轨进行空间渲染,以获取渲染后的第二单对象音轨。
本申请第五方面提供了一种渲染设备,该渲染设备执行前述第一方面或第一方面的任意可能的实现方式中的方法,或执行前述第二方面或第二方面的任意可能的实现方式中的方法。
本申请第六方面提供了一种渲染设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该渲染设备实现上述第一方面或第一方面的任意可能的实现方式中的方法,或者使得该渲染设备实现上述第二方面或第二方面的任意可能的实现方式中的方法。
本申请第七方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法。
本申请第八方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式、第二方面或第二方面的任意可能的实现方式中的方法。
其中,第三、第五、第六、第七、第八方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
其中,第四、第五、第六、第七、第八方面或者其中任一种可能实现方式所带来的技术效果可参见第二方面或第二方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;基于参考信息确定第一发声对象的第一声源位置,基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨 音轨。可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获取其他的附图。
图1为本申请提供的一种***架构的结构示意图;
图2为本申请提供的一种卷积神经网络结构示意图;
图3为本申请提供的另一种卷积神经网络结构示意图;
图4为本申请提供的一种芯片硬件结构示意图;
图5为本申请提供的一种分离网络的训练方法的示意性流程图;
图6为本申请提供的一种分离网络的结构示意图;
图7为本申请提供的另一种分离网络的结构示意图;
图8为本申请提供的另一种***架构的结构示意图;
图9为本申请提供的一种应用场景的示意图;
图10为本申请提供的渲染方法的一个流程示意图;
图11为本申请提供的播放设备校准方法的一个流程示意图;
图12-图17为本申请提供的渲染设备显示界面的几种示意图;
图18为本申请提供的手机朝向示意图;
图19为本申请提供的渲染设备显示界面的另一种示意图;
图20为本申请提供的使用手机朝确定声源位置的示意图;
图21-图47为本申请提供的渲染设备显示界面的另几种示意图;
图48为本申请提供的外放设备***在球坐标系下的一种结构示意图;
图49-图50为本申请提供的用户间共享渲染规则的几种示意图;
图51-图53为本申请提供的渲染设备显示界面的另几种示意图;
图54为本申请提供的猎音人游戏场景下用户交互的示意图;
图55-图57为本申请提供的多人交互场景下用户交互的几种示意图;
图58-图61为本申请提供的渲染设备的几种结构示意图;
图62为本申请提供的传感器设备的一种结构示意图。
具体实施方式
本申请实施例提供了一种渲染方法,可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本发明 保护的范围。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络。
神经网络可以是由神经单元组成的,神经单元可以是指以X s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022087353-appb-000033
其中,s=1、2、……n,n为大于1的自然数,W s为X s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
2、深度神经网络。
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。当然,深度神经网络也可能不包括隐藏层,具体此处不做限定。
深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2022087353-appb-000034
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2022087353-appb-000035
完成,4的操作由
Figure PCTCN2022087353-appb-000036
完成,5的操作则由α()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终获取训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
3、卷积神经网络。
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元 层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习获取的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习获取合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。本申请实施例中的分离网络、识别网络、检测网络、深度估计网络等网络都可以是CNN。
4、循环神经网络(RNN)。
在传统的神经网络中模型中,层与层之间是全连接的,每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题是无法解决的。比如,预测句子的下一个单词是什么,因为一个句子中前后单词并不是独立的,一般需要用到前面的单词。循环神经网络(RNN)指的是一个序列当前的输出与之前的输出也有关。具体的表现形式为网络会对前面的信息进行记忆,保存在网络的内部状态中,并应用于当前输出的计算中。
5、损失函数。
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
6、头相关传输函数。
头相关传输函数(head related transfer function,HRTF):声源发出的声波经头部、耳廓、躯干等散射后到达双耳,其中的物理过程可视为一个线性时不变的声滤波***,其特性可由HRTF描述,也就是说HRTF描述了声波从声源到双耳的传输过程。更形象的解释为:若声源发出的音频信号为X,该音频信号为X传输到预定位置后对应的音频信号为Y,则X*Z=Y(X卷积Z等于Y),其中,Z即为HRTF。
7、音轨。
音轨是记录音频数据的轨道,每条音轨具有一个或多个属性参数,所述属性参数包括音频格式、码率、配音语言、音效、通道数、音量等等。当音频数据为多音轨时,不同的两个音轨至少具有一个不同的属性参数,或者不同的两个音轨中至少一个属性参数具备不同的值。音轨可以是单音轨或多音轨(或称为混合音轨)。其中,单音轨可以与一个或多个发声对象对 应,多音轨包括至少两个单音轨。一般情况下,一个单对象音轨对应一个发声对象。
8、短时傅里叶变换。
短时傅里叶变换(short-time fourier transform,或short-term fourier transform,STFT)的核心思想:“加窗”,即把整个时域过程分解成无数个等长的小过程,每个小过程近似平稳,再对每个小过程进行快速傅里叶变换(fast fourier transform,FFT)。
下面介绍本申请实施例提供的***架构。
参见附图1,本发明实施例提供了一种***架构100。如所述***架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:多媒体文件,该多媒体文件包括原始音轨,该原始音轨与至少一个发声对象对应。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练获取目标模型/规则101。下面将以实施例一更详细地描述训练设备120如何基于训练数据获取目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的渲染方法,其中,该目标模型/规则101有多种情况。目标模型/规则101的一种情况(当该目标模型/规则101为第一模型时),将多媒体文件输入该目标模型/规则101,即可获取与第一发声对象对应的第一单对象音轨。目标模型/规则101的另一种种情况(当该目标模型/规则101为第二模型时),将多媒体文件通过相关预处理后输入该目标模型/规则101,即可获取与第一发声对象对应的第一单对象音轨。本申请实施例中的目标模型/规则101具体可以包括分离网络,进一步的还可以包括识别网络、检测网络、深度估计网络等,具体此处不做限定。在本申请提供的实施例中,该分离网络是通过训练训练数据获取的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收获取的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练获取的目标模型/规则101可以应用于不同的***或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:多媒体文件,可以是用户输入的,也可以是用户通过音频设备上传的,当然还可以来自数据库,具体此处不做限定。
预处理模块113用于根据I/O接口112接收到的多媒体文件进行预处理,在本申请实施例中,预处理模块113可以用于对多媒体文件中的音轨进行短时傅里叶变换处理,获取时频谱。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储***150中的数据、代码等以用于相应的处理,也可以将相应处理获取的数据、指令等存入数据存储***150中。
最后,I/O接口112将处理结果,如上述获取的与第一发声对象对应的第一单对象音轨返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获取用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种***架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储***150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储***150置于执行设备110中。
如图1所示,根据训练设备120训练获取目标模型/规则101,该目标模型/规则101在本申请实施例中可以是分离网络,具体的,在本申请实施例提供的网络中,分离网络可以是卷积神经网络或者循环神经网络。
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。
卷积层/池化层120:
卷积层:
如图2所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩 阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练获取,通过训练获取的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图2中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样获取较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练获取,例如该任务类型可以包括多音轨分离、图像识别,图像分类,图像超分辨率重建等等。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2由110至140的传播为前向传播)完成,反向传播(如图2由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差, 以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图3所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
下面介绍本申请实施例提供的一种芯片硬件结构。
图4为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器40。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图4所示的芯片中得以实现。
神经网络处理器40可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU40作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路403,控制器404控制运算电路403提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路403内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路403是二维脉动阵列。运算电路403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器401中取矩阵A数据与矩阵B进行矩阵运算,获取的矩阵的部分结果或最终结果,保存在累加器408中。
向量计算单元407可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元407可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现种,向量计算单元能407将经处理的输出的向量存储到统一缓存器406。例如,向量计算单元407可以将非线性函数应用到运算电路403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路403的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器406用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器405(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器401和/或统一存储器406、将外部存储器中的权重数据存入权重存储器402,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)410,用于通过总线实现主CPU、DMAC和取指存储器409之间进行交互。
与控制器404连接的取指存储器(instruction fetch buffer)409,用于存储控制器404使用的指令。
控制器404,用于调用指存储器409中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器406,输入存储器401,权重存储器402以及取指存储器409均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2或图3所示的卷积神经网络中各层的运算可以由运算电路403或向量计算单元407执行。
下面结合附图对本申请实施例的分离网络的训练方法和渲染方法进行详细的介绍。
首先,结合图5对本申请实施例的分离网络的训练方法进行详细介绍。图5所示的方法可以由分离网络的训练装置来执行,该分离网络的训练装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行分离网络的训练方法的装置,也可以是由云服务设备和终端设备构成的***。示例性地,训练方法可以由图1中的训练设备120、图4中的神经网络处理器40执行。
可选地,训练方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
可以理解的是,若本申请实施例中多媒体文件中原始音轨对应的发声对象的数量为多个时,可以使用分离网络对该原始音轨进行分离获取至少一个单对象音轨。当然,若多媒体文件中原始音轨只与一个发声对象对应,则该原始音轨即为单对象音轨,不需要使用分离网络进行分离。
训练方法可以包括步骤501与步骤502。下面对步骤501与步骤502进行详细说明。
步骤501、获取训练数据。
本申请实施例中的训练数据至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,也可以理解为训练数据包括至少两个发声对象对应的单对象音轨合成的多音轨。初始第三单对象音轨与第三发声对象对应,初始第四单对象音轨与第四发声对象对应,另外,练数据还可以包括与原始音轨匹配的图像,其中,训练数据也可以是多媒体文件,该多媒体文件包括上述的多音轨,多媒体文件除了包括音轨,还可以包括视频轨或文字轨(或称为弹幕轨)等,具体此处不做限定。
本申请实施例中的音轨(原始音轨、第一单对象音轨等)可以包括人声轨、乐器轨(例如:鼓声轨、钢琴轨、小号轨等)、飞机声等由发声对象(或称为发声物)产生的音轨,对于音轨对应的具体发声对象此处不做限定。
本申请实施例中获取训练数据可以是通过直接录制发声对象发声的方式获取,也可以是通过用户输入音频信息、视频信息的方式获取,还可以是通过接收采集设备发送的方式获取,在实际应用中,还有其他方式获取训练数据,对于训练数据的获取方式具体此处不做限定。
步骤502、以训练数据作为分离网络的输入,以损失函数的值小于第一阈值为目标对分 离网络进行训练,获取训练好的分离网络。
本申请实施例中的分离网络可以称为分离神经网络,也可以称为分离模型,还可以称为分离神经网络模型等,具体此处不做限定。
其中,损失函数用于指示分离获取的第三单对象音轨与初始第三单对象音轨之间的差异。
在该情况下,以减小损失函数的值为目标对分离网络进行训练,也就是不断缩小分离网络输出第三单对象音轨与初始第三单对象音轨之间的差异。该训练过程可以理解为分离任务。损失函数可以理解为分离任务对应的损失函数。其中,分离网络的输出(至少一个单对象音轨)是输入(音轨)中的至少一个发声对象对应的单对象音轨。第三发声对象与第一发声对象的属于相同类型,第二发声对象与第四发声对象的属于相同类型。例如:第一发声对象与第三发声对象都是人声,但第一发声对象可以是A用户,第二发声对象可以是B用户。换句话说,第三单对象音轨与第一单对象音轨为不同的人发出的声音对应的音轨。本申请实施例中的第三发声对象与第一发声对象可以是相同类型中的两个发声对象,也可以是相同类型中的一个发声对象,具体此处不做限定。
可选地,分离网络输入的训练数据包括至少两个发声对象对应的原始音轨,分离网络可以输出至少两个发声对象中某一个发声对象对应的单对象音轨,也可以输出至少两个发声对象分别对应的单对象音轨。
示例性的,多媒体文件包括人声对应的音轨、钢琴对应的音轨以及车声对应的音轨。该多媒体文件通过分离网络进行分离后,可以获取一个单对象音轨(例如人声对应的单对象音轨)、两个单对象音轨(例如人声对应的单对象音轨与车声对应的单对象音轨)或三个单对象音轨。
一种可能实现的方式中,分离网络如图6所示,分离网络包括一维卷积以及残差结构。其中,加入残差结构可以提高梯度传递效率。当然,分离网络还可以包括激活。池化等。对于分离网络的具体结构此处不做限定。图6所示的分离网络是以信号源(即多媒体文件中音轨对应的信号)为输入,通过多次卷积与反卷积进行变换,输出对象信号(一个发声对象对应的单音轨)。另外,还可以通过加入循环神经网络模块提高时序相关性,通过连接不同输出层提高高低维特征的联系。
另一种可能实现的方式中,分离网络如图7所示,在将信号源输入分离网络之前,可以先对信号源进行预处理,例如,对信号源进行STFT映射处理,获取时频谱。将时频谱中的幅度谱经过二维卷积与反卷积进行变换获取掩蔽谱(筛选出来的谱),将掩蔽谱与幅度谱结合获取目标幅度谱。再将目标幅度谱与相位谱相乘获取目标时频谱。对目标时频谱进行逆短时傅里叶变换(iSTFT)映射获取对象信号(一个发声对象对应的单音轨)。其中,也可以通过连接不同输出层提高高低维特征的联系、加入残差结构提高梯度传递效率以及加入循环神经网络模块提高时序相关性。
其中,图6的输入也可以理解为是一维时域信号,图7的输入是二维时频谱信号。
上述两种分离模型只是举例,在实际应用中,还有其他可能的结构。分离模型的输入可以是时域信号,输出可以是时域信号,分离模型的输入可以是时频域信号,输出是时频域信号等,对于分离模型的结构、输入或输出具体此处不做限定。
可选地,在输入分离网络之前,还可以先经过识别网络对多媒体文件中的多音轨进行识 别,识别出该多音轨包括音轨的数量以及对象类别(例如:人声、鼓声等),可以减少分离网络的训练时长。当然,分离网络也可以包括识别多音轨的识别子网络,具体此处不做限定。其中,该识别网络的输入可以是时域信号,输出是类别概率。相当于将时域信号输入识别网络,获取对象是某种类别的概率,选择概率超过阈值的类别为分类的类别。这里的对象也可以理解为是发声对象。
示例性的,上述识别网络中的输入是A车辆与B车辆对应的音频合成的多媒体文件,将该多媒体文件输入识别网络,该识别网络可以输出类别为车。当然,在训练数据足够全面的情况下,识别网络也可以识别出来具体车的种类,相当于更进一步的细粒度识别。识别网络根据实际需要设置,具体此处不做限定。
需要说明的是,训练过程也可以不采用前述训练方法而采用其他训练方法,此处不做限定。
请参阅图8,本申请提供的另一种***架构。该***架构包括输入模块、功能模块、数据库模块以及输出模块。下面对各个模块进行详细描述:
1、输入模块。
输入模块包括数据库选项子模块、传感器信息获取子模块、用户界面输入子模块以及文件输入子模块。其中,上述四种子模块也可以理解为输入的四种方式。
数据库选项子模块用于根据用户选择的数据库中存储的渲染方式进行空间渲染。
传感器信息获取子模块用于根据传感器(可以是渲染设备中的传感器,也可以是另外的传感器设备,具体此处不做限定)进行特定发声对象的空间位置的指定,该种方式,可以由用户选择特定发声对象的位置。
用户界面输入子模块用于响应用户对于用户界面的操作确定特定发声对象的空间位置。可选地,用户可以通过点击、拖拽等形式对特定发声对象的空间位置进行控制。
文件输入子模块用于根据图像信息或文本信息(例如:歌词、字幕等)进行特定发声对象的追踪,进而根据追踪到特定发声对象的位置,确定特定发声对象的空间位置。
2、功能模块。
功能模块包括信号传输子模块、对象识别子模块、校准子模块、对象追踪子模块、方位计算子模块、对象分离子模块以及渲染子模块。
信号传输子模块用于接收与发送信息。具体可以是接收输入模块的输入信息,输出反馈信息给其他模块。例如,反馈信息包括特定发声对象位置变换信息、分离后的单对象音轨等信息。当然,信号传输子模块还可以用于将识别出的对象信息通过用户界面(user interface,UI)反馈给用户等,具体此处不做限定。
对象识别子模块用于识别信号传输子模块接收到输入模块发送的多音轨信息所有的对象信息,这里的对象是指发声对象(或称为发声物),例如人声、鼓声、飞机声等。可选地,该对象识别子模块可以是前述图5所示实施例中所描述的识别网络或分离网络中的识别子网络。
校准子模块用于对播放设备的初始状态校准,例如:播放设备为耳机时,校准子模块用于耳机校准,播放设备为外放设备时,校准子模块用于外放设备校准。对于耳机校准:可以默认传感器(图9会对传感器设备与播放设备之间的关系进行说明)的初始状态为正前方,后续通过该正前方进行校正。也可以获取用户拜访传感器的真实位置,保证声像的正前方是 耳机的正前方。对于外放设备校准:先获取每个外放设备的坐标位置(可以根据用户终端的传感器进行交互获取、后续图9会有对应描述)。校准子模块校准播放设备之后,将校准后的播放设备信息通过信号传输子模块传到数据库模块中。
对象追踪子模块用于跟踪特定发声对象的运动轨迹。该特定发声对象可以是多模态文件(例如:音频信息以及与该音频信息对应的视频信息、音频信息以及与该音频信息对应的文字信息等)中显示的文字或图像中的发声物。可选地,该对象追踪子模块还可以用于在音频侧进行运动轨迹的渲染。另外,该对象追踪子模块还可以包括目标识别网络和深度估计网络,该目标识别网络用于识别需要追踪的特定发声对象,该深度估计网络用于获取图像中特定发声对象的相对坐标(后续实施例中会详细描述),使得对象追踪子模块根据相对坐标对特定发声对象对应的音频进行方位与运动轨迹的渲染。
方位计算子模块用于将输入模块获取的信息(例如:传感器信息、UI界面的输入信息、文件信息等)转化为方位信息(也可以称为声源位置)。针对不同的信息会有相应的转化方法,后续实施例中会详细描述转化的具体过程。
对象分离子模块用于将多媒体文件(或称为多媒体信息)或多音轨信息分离出至少一个单对象音轨。例如:从歌曲中提取单独的人声轨(即只有人声的音频文件)。其中,该对象分离子模块可以是前述图5所示实施例中的分离网络。进一步的,该对象分离子模块的结构可以如图6或图7所示,具体此处不做限定。
渲染子模块用于获取方位计算子模块获取的声源位置,并对声源位置进行空间渲染。进一步的,可以根据输入模块中UI的输入信息选择的播放设备来确定相应的渲染方法。针对不同的播放设备,渲染方式存在不同,后续实施例中会详细描述渲染的过程。
3、数据库模块。
数据库模块包括数据库选择子模块、渲染规则编辑子模块以及渲染规则共享子模块。
数据库选择子模块用于存储渲染规则。该渲染规则可以是***初始化时自带的默认双声道/多声道音轨转化成三维(three dimensional,3D)空间感的渲染规则,也可以是用户保存的渲染规则。可选地,不同对象可以对应相同的渲染规则,或者不同对象可以对应不同的渲染规则。
渲染规则编辑子模块用于对保存的渲染规则进行重新编辑。可选地,该保存的渲染规则可以是数据库选择子模块中存储的渲染规则,也可以是新输入的渲染规则,具体此处不做限定。
渲染规则共享子模块用于将渲染规则上传至云端,和/或用于从云端的渲染规则数据库中下载特定的渲染规则。例如,渲染规则共享模块可以将用户自定义的渲染规则上传到云端,分享给其他用户。用户可以从云端存储的渲染规则数据库中选择与待播放多音轨信息匹配的其他用户分享的渲染规则,下载到终端侧的数据库,作为进行音频3D渲染规则的数据文件。
4、输出模块。
输出模块用于将渲染后的单对象音轨或目标音轨(根据原始音轨以及渲染后的单对象音轨获取)通过播放设备进行播放。
首先,先对本申请实施例提供的渲染方法所适用的应用场景进行描述:
请参阅图9,该应用场景包括控制设备901,传感器设备902以及播放设备903。
本申请实施例中的播放设备903可以是外放设备,也可以是耳机(例如入耳式耳机、头戴式耳机等),还可以是大屏(例如投影屏)等,具体此处不做限定。
其中,控制设备901与传感器设备902之间,以及传感器设备902与播放设备903之间可以通过有线、无线保真(wireless fidelity,WIFI)、移动数据网或其他连接方式连接,具体此处不做限定。
本申请实施例中的控制设备901是一种用于服务用户的终端设备,终端设备可以包括头戴显示设备(head mount display,HMD)、该头戴显示设备可以是虚拟现实(virtual reality,VR)盒子与终端的组合,VR一体机,个人计算机(personal computer,PC)VR,增强现实(augmented reality,AR)设备,混合现实(mixed reality,MR)设备等,该终端设备还可以包括蜂窝电话(cellular phone)、智能电话(smart phone)、个人数字助理(personal digital assistant,PDA)、平板型电脑、膝上型电脑(laptop computer)、个人电脑(personal computer,PC)、车载终端等,具体此处不做限定。
本申请实施例中的传感器设备902是一种用于感知朝向和/或位置的设备,可以是激光笔、手机、智能手表、智能手环、具有惯性测量单元(inertial measurement unit,IMU)的设备、具有即时定位与地图构建(simultaneous localization and mapping,SLAM)传感器的设备等,具体此处不做限定。
本申请实施例中的播放设备903是一种用于播放音频或视频的设备,可以是外放设备(例如:音响、具有播放音频或视频功能的终端设备),也可以是内放设备(例如:入耳式耳机、头戴式耳机、AR设备、VR设备等)等,具体此处不做限定。
可以理解的是,图9所示应用场景中各设备的数量可以是一个或多个,例如外放设备可以包括多个,对于各设备的数量具体此处不做限定。
本申请实施例中的控制设备、传感器设备、播放设备可以是三个设备,也可以是二个设备,还可以是一个设备,具体此处不做限定。
一种可能实现的方式中,图9所示应用场景中的控制设备与传感器设备为同一个设备。例如:控制设备与传感器设备为同一个手机,播放设备为耳机。又例如:控制设备与传感器设备为同一个手机,播放设备为外放设备(也可以称为外放设备***,该外放设备***包括一个或多个外放设备)。
另一种可能实现的方式中,图9所示应用场景中的控制设备与播放设备为同一个设备。例如:控制设备与播放设备为同一个电脑。又例如:控制设备与播放设备为同一个大屏。
另一种可能实现的方式中,图9所示应用场景中的控制设备、传感器设备、播放设备为同一设备。例如:控制设备、传感器设备、播放设备为同一个平板电脑。
下面结合上述应用场景以及附图对本申请实施例的渲染方法进行详细的介绍。
请参阅图10,本申请实施例提供的渲染方法一个实施例,该方法可以由渲染设备执行,也可以由渲染设备的部件(例如处理器、芯片、或芯片***等)执行,该实施例包括步骤1001至步骤1004。
本申请实施例中,该渲染设备可以具有如图9中控制设备的功能、传感器设备的功能和/或播放设备的功能,具体此处不做限定。下面以渲染设备是控制设备(例如笔记本),传感器设备是具有IMU的设备(例如手机),播放设备是外放设备(例如音响)为例对渲染方法进行 描述。
本申请实施例中所描述的传感器可以是指渲染设备中的传感器,也可以是指除了渲染设备以外的设备(如前述的传感器设备)中的传感器,具体此处不做限定。
步骤1001、校准播放设备。本步骤是可选地。
可选地,在播放设备播放渲染后的音轨之前,可以先对播放设备进行校准,校准的目的是为了提升渲染音轨的空间效果的真实性。
本申请实施例中校准播放设备的方式可以有多种,下面仅以播放设备是外放设备为例对一种播放设备的校准过程进行示例性说明。参考图11,本实施例提供的一种校准播放设备的方法,该方法包括步骤1至步骤5。
可选地,在校准前,用户手持的手机与外放设备建立连接,连接方式与如图9所示实施例中对于传感器设备与播放设备之间的连接方式类似,此处不再赘述。
步骤1、确定播放设备类型。
本申请实施例中,渲染设备可以通过用户的操作确定播放设备类型,也可以自适应检测播放设备类型,也可以通过默认设置确定播放设备类型,还可以是其他方式确定播放设备类型,具体此处不做限定。
示例性的,如果渲染设备通过用户的操作确定播放设备类型,渲染设备可以显示如图12所示的界面,该界面包括选择播放设备类型图标。另外,该界面还可以包括选择输入文件图标、选择渲染方式(即参考信息选项)图标、校准图标、猎音人图标、对象栏、音量、时长进度以及球视图(或称为三维视图)。如图13所示,用户可以点击“选择播放设备类型图标”101。如图14所示,渲染设备响应于点击操作,显示下拉菜单,该下拉菜单可以包括“外放设备选项”与“耳机选项”。进一步的,用户可以点击“外放设备选项”102,进而确定播放设备的类型为外放设备。如图15所示,渲染设备显示的界面中,可以由“外放设备”替换“选择播放设备类型”,以提示用户现在的播放设备类型为外放设备。也可以理解为,渲染设备显示如图12所示的界面,渲染设备接收用户的第五操作(即如图13与图14所示的点击操作),渲染设备响应该第五操作,从播放设备类型选项中选择播放设别类型为外放设备。
此外,由于该方法是为了校准播放设备,如图16所示,用户还可以点击“校准图标”103,如图17所示,渲染设备响应于点击操作,显示下拉菜单,该下拉菜单可以包括“默认选项”与“手动校准选项”。进一步的,用户可以点击“手动校准选项”104,进而确定校准的方式为自动校准,该自动校准可以理解为用户使用手机(即传感器设备)对播放设备进行校准。
图14中仅以“选择播放设备类型图标”的下拉菜单包括“外放设备选项”与“耳机选项”为例,实际应用中,该下拉菜单也可以包括耳机的具体类型选项,例如头戴式耳机、入耳式耳机、有线耳机、蓝牙耳机等选项,具体此处不做限定。
图17中仅“校准图标”的下拉菜单包括“默认选项”与“手动校准选项”为例,实际应用中,该下拉菜单也可以包括其他类型的选项,具体此处不做限定。
步骤2、确定测试音频。
本申请实施例中的测试音频可以是默认设置的测试信号(例如粉红噪声),也可以是通过上述图5所示实施例中的分离网络从歌曲(即多媒体文件为歌曲)中分离出来的人声对应的 单对象音轨,也可以是歌曲中其他单对象音轨对应的音频,还可以是只包括单对象音轨的音频等,具体此处不做限定。
示例性的,用户可以在渲染设备显示的界面中点击“选择输入文件图标”选择测试音频。
步骤3、获取手机的姿态角以及传感器与外放设备之间的距离。
确定测试音源后,外放设备依次对测试音频进行播放,用户手持传感器设备(例如手机)指向正在播放测试音频的外放设备。待手机摆位稳定后,记录当前手机的朝向以及接收到的该测试音频的信号能量,并根据下述公式一计算手机与该外放设备的距离。在外放设备为多个时,操作类似,此处不再赘述。其中,手机摆位稳定可以理解为在一段时间(例如200毫秒)内,手机朝向的方差小于阈值(例如5度)。
可选地,若播放设备是两个外放设备,第一个外放设备先播放测试音频,用户手持手机指向该第一个外放设备。待第一个外放设备校准完毕后,用户手持手机再指向第二个外放设备进行校准。
本申请实施例中手机的朝向可以是指手机的姿态角,该姿态角可以包括方位角与倾斜角(或称为倾侧角),或者该姿态角包括方位角、倾斜角以及俯仰角。其中,方位角代表绕z轴的角度,倾斜角代表绕y轴的角度,俯仰角代表绕x轴的角度。手机朝向与x轴、y轴、z轴的关系可以如图18所示。
示例性的,延续上述举例,播放设备是两个外放设备,第一个外放设备先播放测试音频,用户手持手机指向该第一个外放设备,记录当前手机的朝向以及接收到该测试音频的信号能量。接着,第二个外放设备播放测试音频,用户手持手机指向该第二个扬声器,记录当前手机的朝向以及接收到该测试音频的信号能量。
进一步的,在校准外放设备过程中,渲染设备可以显示如图19所示的界面,其中该界面的右侧为球视图,该球视图中可以显示已校准的外放设备以及正在校准的外放设备。另外,还可以显示未校准的外放设备(图中未示出),具体此处不做限定。该球视图的圆心为用户所在的位置(也可以理解为是用户手持手机的位置,由于用户手持手机,所以手机位置近似用户所在位置),半径可以是用户所在位置(或手机位置)与外放设备的距离,也可以是默认值(例如1米)等,具体此处不做限定。
为了便于理解,如图20所示,用户手持手机朝向外放设备的一种示例效果图。
本申请实施例中外放设备的数量为N个,且N为正整数,第i个外放设备是指N个外放设备中的某一个外放设备,其中,i为正整数且i≤N。本申请实施例中的公式都是以第i个外放设备为例进行计算,其他外放设备的计算与第i个外放设备的计算类似。
校准第i个外放设备采用的的公式一可以如下所述:
公式一:
Figure PCTCN2022087353-appb-000037
其中,x(t)为t时刻下手机接收到测试信号的能量,X(t)为t时刻下外放设备播放测试信号的能量,t为正数,r i为手机与第i个外放设备之间的距离(由于用户手持手机,所以也可以理解为是用户与第i个外放设备之间的距离),r s为归一化距离,该归一化距离可以 理解为是一个系数,用于转化x(t)与X(t)的比值到距离,该系数可以根据实际外放设备的情况进行设置,r s的具体取值此处不做限定。
另外,当外放设备为多个时,依次播放测试信号并朝向外放设备,通过公式一得到距离。
可以理解的是,上述公式一只是一种举例,实际应用中,公式一还可以有其他形式,例如去掉等,具体此处不做限定。
步骤4、基于姿态角与距离确定外放设备的位置信息。
在步骤3中,手机已记录手机朝向每个外放设备的姿态角以及通过公式一计算出手机分别与每个外放设备之间的距离。当然,手机也可以将测得的与发送给渲染设备,由渲染设备通过公式一计算出手机分别与每个外放设备之间的距离,具体此处不做限定。
渲染设备获取手机的姿态角以及手机与外放设备之间的距离之后,可以通过公式二将手机的姿态角以及手机与外放设备之间的距离转化为外放设备在球坐标系中的位置信息,该位置信息包括方位角、倾斜角以及距离(即传感器设备与播放设备之间的距离)。当外放设备***中外放设备的数量为多个时,确定其他外放设备的位置信息类似,此处不再赘述。
上述的公式二可以如下所述:
公式二:
Figure PCTCN2022087353-appb-000038
其中,λ(t)为t时刻下第i个外放设备在球坐标系中的方位角,Φ(t)为t时刻下第i个外放设备在球坐标系中的倾斜角,d(t)为手机与第i个外放设备之间的距离;Ω(t)[0]为t时刻下手机的方位角(即手机绕z轴的旋转角度),Ω(t)[1]为t时刻下手机的俯仰角(即手机绕x轴的旋转角度);r i为公式一所求出的距离,sign代表正负值,若Ω(t)[1]为正,则sign为正;若Ω(t)[1]为负,则sign为负;%360用于调整角度范围至0度-360度,例如:若Ω(t)[0]的角度为-80度,则Ω(t)[0]%360代表-80+360=280度。
可以理解的是,上述公式二只是一种举例,实际应用中,公式二还可以有其他形式,具体此处不做限定。
示例性的,待用户校准完外放设备之后,渲染设备可以显示如图21所示的界面,该界面显示“已校准图标”,右侧球视图中可以显示已校准播放设备的位置。
通过上述校准播放设备,可以解决校准不规则的外放设备的难题,使用户在后续操作中获取各外放设备的空间定位,从而精确渲染出单对象音轨所需的位置,提升渲染音轨的空间效果的真实性。
步骤1002、基于多媒体文件获取第一单对象音轨。
本申请实施例中渲染设备获取多媒体文件可以是通过直接录制第一发声对象发声的方式获取,也可以是通过其他设备发送的方式获取,例如:通过接收采集设备(例如:摄像机、录音机、手机等)发送的方式获取等,在实际应用中,还有其他方式获取多媒体文件,对于多媒体文件的具体获取方式此处不做限定。
本申请实施例中的多媒体文件具体可以是音频信息,例如:立体声音频信息、多声道音频信息等。多媒体文件具体也可以是多模态信息,例如该多模态信息是视频信息、与音频信息对应的图像信息、文字信息等。也可以理解为多媒体文件除了包括音轨,还可以包括视频轨或文字轨(或称为弹幕轨)等,具体此处不做限定。
另外,多媒体文件可以包括第一单对象音轨,或者包括原始音轨,该原始音轨至少由两个单对象音轨合成,具体此处不做限定。其中,原始音轨可以是单音轨,也可以是多音轨,具体此处不做限定。原始音轨可以包括人声轨、乐器轨(例如:鼓声轨、钢琴轨、小号轨等)、飞机声等由发声对象(或称为发声物)产生的音轨,对于原始音轨对应的发声对象的具体类型此处不做限定。
根据多媒体文件中原始音轨的多种情况,本步骤的处理方式可能不同,下面分别描述:
第一种,多媒体文件中的音轨为单对象音轨。
该种情况下,渲染设备可以直接从多媒体文件中获取第一单对象音轨。
第二种,多媒体文件中的音轨为多对象音轨。
该种情况也可以理解为多媒体文件中的原始音轨对应多个发声对象。可选地,该原始音轨除了与第一发声对象对应以外,还与第二发声对象对应。即该原始音轨至少由第一单对象音轨与第二单对象音轨合成获取,第一单对象音轨与第一发声对象对应,第二单对象音轨与第二发声对象对应。
该种情况下,渲染设备可以从原始音轨中分离出第一单对象音轨,也可以从原始音轨中分离出第一单对象音轨与第二单对象音轨,具体此处不做限定。
可选地,渲染设备可以通过前述图5所示实施例中的分离网络从原始音轨中分离出第一单对象音轨。另外,渲染设备还可以通过分离网络从原始音轨中分离出第一单对象音轨以及第二单对象音轨,具体此处不做限定,输出的不同取决于训练分离网络的方式不同,具体可以参考图5所示实施例中的描述,此处不再赘述。
可选地,渲染设备确定多媒体文件之后,渲染设备可通过识别网络或分离网络识别出多媒体文件中原始音轨的发声对象,例如该原始音轨中包含的发声对象包括第一发声对象以及第二发声对象。渲染设备可以随机选取其中一个发声对象为第一发声对象,也可以通过用户的选择确定第一发声对象。进一步的,渲染设备确定第一发声对象之后,可以通过分离网络获取第一单对象音轨。当然,渲染设备在确定多媒体文件之后,可以先通过识别网络获取发声对象,再通过分离网络获取发声对象的单对象音轨。也可以直接通过识别网络和/或分离网络获取多媒体文件包括的发声对象以及发声对象对应的单对象音轨,具体此处不做限定。
示例性的,延续上述举例,在对播放设备校准完毕之后,渲染设备可以显示如图21所示的界面或如图22所示的界面。用户可以通过点击“选择输入文件图标”105选取多媒体文件,这里的多媒体文件以“Dream it possible.wav”为例。也可以理解为,渲染设备接收用户的第四操作,渲染设备响应该第四操作,从存储区存储的至少一个多媒体文件中确定中选择“Dream it possible.wav”(即目标文件)为多媒体文件。其中,该存储区可以是渲染设备中的存储区,也可以是外接设备(例如U盘等)中的存储区,具体此处不做限定。在用户选取多媒体文件之后,渲染设备可以显示如图23所示的界面,该界面中,可以由“Dream it possible.wav”替换“选择输入文件”,以提示用户现在的多媒体文件是:Dream it  possible.wav。另外,渲染设备可以利用图4所示实施例中的识别网络和/或分离网络,识别“Dream it possible.wav”中的发声对象以及分离出每个发声对象对应的单对象音轨。例如,渲染设备识别出“Dream it possible.wav”包括的发声对象为人、钢琴、小提琴、吉他。如图23所示,渲染设备显示的界面还可以包括对象栏,对象栏中可以显示“人声图标”、“钢琴图标”、“小提琴图标”、“吉他图标”等图标,供用户选择待渲染的发声对象。可选地,该对象栏中还可以显示“合图标”,用户可以通过点击“合图标”停止发声对象的选择。
进一步的,如图24所示,用户可以通过点击“人声图标”106确定待渲染的音轨为人声对应的单对象音轨。也可以理解为,渲染设备在对“Dream it possible.wav”进行识别,获取渲染设备显示如图24所示的界面,渲染设备接收用户的点击操作,渲染设备响应该点击操作,从界面中选择第一图标(即“人声图标”106),进而使渲染设备确定第一单对象音轨为人声。
可以理解的是,图22至图24所示的播放设别类型仅以外放设备为例,当然,用户可以选择播放设备的类型为耳机,接下来仅以校准中用户选择的播放设备类型是外放设备为例进行示意性说明。
另外,渲染设备还可以复制原始音轨中的某一个或某几个单对象音轨,例如,如图25所示,用户还可以在对象栏中复制“人声图标”获取“人声2图标”,人声2对应的单对象音轨与人声对应的单对象音轨相同。其中,复制的方式可以是用户双击“人声图标”,还可以是双击球视图上的人声,具体此处不做限定。用户复制获取“人声2图标”,可以默认用户失去对人声的控制权,开始控制人声2。可选地,在用户复制获取人声2后,人声的第一声源位置还可以在球视图中显示,当然,用户除了可以复制发声对象,也可以删除发声对象。
步骤1003、基于参考信息确定第一发声对象的第一声源位置。
本申请实施例中,在多媒体文件的原始音轨包括多个发声对象时,可以基于参考信息确定一个发声对象的声源位置,也可以确定多个发声对象对应的多个声源位置,具体此处不做限定。
示例性的,延续上述举例,渲染设备确定第一发声对象为人声,渲染设备可以显示如图26所示的界面,用户可以点击“选择渲染方式图标”107选择参考信息,该参考信息用于确定第一发声对象的第一声源位置。如图27所示,渲染设备响应于用户的第一操作(即前述的点击操作),可以显示下拉菜单,该下拉菜单可以包括“自动渲染选项”与“互动渲染选项”。其中,“互动渲染选项”与参考位置信息对应,“自动渲染选项”与媒体信息对应。
也可以理解为渲染方式包括自动渲染方式与互动渲染方式,其中,自动渲染方式是指,渲染设备根据多媒体文件中的媒体信息自动获取渲染后的第一单对象音轨。互动渲染方式是指,通过用户与渲染设备的交互,获取渲染后的第一单对象音轨。换句话说,当确定的是自动渲染方式时,渲染设备可以基于预设方式获取渲染后的第一单对象音轨;或当确定的是互动渲染方式时,响应于用户的第二操作以获得参考位置信息;基于参考位置信息确定第一发声对象的第一声源位置;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。其中,预设方式包括:获取多媒体文件的媒体信息;基于媒体信息确定第一发声对象的第一声源位置;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
另外,本申请实施例中的声源位置(第一声源位置以及第二声源位置)可以是某个时刻的固定位置,也可以是某个时间段内的多个位置(例如运动轨迹),具体此处不做限定。
本申请实施例中的参考信息有多种情况,下面分别描述:
第一种,参考信息包括参考位置信息。
本申请实施例中的参考位置信息用于指示第一发声对象的声源位置,参考位置信息可以是传感器设备的第一位置信息,也可以是用户选择的第二位置信息等,具体此处不做限定。
本申请实施例中的参考位置信息有多种情况,下面分别描述:
1、参考位置信息是传感器设备(以下简称传感器)的第一位置信息。
示例性的,延续上述举例,如图27所示,进一步的,用户可以点击“互动渲染选项”108,确定渲染方式为互动渲染。渲染设备可以响应于点击操作,显示下拉菜单,该下拉菜单可以包括“朝向控制选项”、“位置控制选项”以及“界面控制选项”。
本申请实施例中的第一位置信息有多种情况,下面分别描述:
1.1、第一位置信息包括传感器的第一姿态角。
与之前通过传感器的朝向校准外放设备类似,用户可以通过第二操作(例如上下左右的平移)调整手持传感器设备(例如手机)的朝向,进而确定第一单对象音轨的第一声源位置。也可以理解为,渲染设备可以接收手机的第一姿态角,并利用下述公式三获取第一单对象音轨的第一声源位置,该第一声源位置包括方位角、倾斜角以及外放设备与手机之间的距离。
进一步的,用户还可以再通过调整手持手机的朝向确定第二单对象音轨的第二声源位置。也可以理解为,渲染设备可以接收手机的第一姿态角(包括方位角以及倾斜角),并利用下述公式三获取第二单对象音轨的第二声源位置,该第二声源位置包括方位角、倾斜角以及外放设备与手机之间的距离。
可选地,若手机与渲染设备未建立连接,渲染设备可以向用户发送提醒信息,该提醒信息用于提醒用户连接手机与渲染设备。当然,手机与渲染设备也可以是同一个手机,该种情况下不用发送提醒信息。
上述的公式三可以如下所述:
公式三:
Figure PCTCN2022087353-appb-000039
其中,λ(t)为t时刻下第i个外放设备在球坐标系中的方位角,Φ(t)为t时刻下第i个外放设备在球坐标系中的倾斜角,d(t)为t时刻下手机与第i个外放设备之间的距离;Ω(t)[0]为t时刻下手机的方位角(即手机绕z轴的旋转角度),Ω(t)[1]为t时刻下手机的倾斜角(即手机绕y轴的旋转角度);d(t)为t时刻下手机与第i个外放设备之间的距离,可以用校准时由公式一所求出的距离,也可以是默认数值(例如1米),d(t)可以根据需要调整;sign代表正负值,若Ω(t)[1]为正,则sign为正;若Ω(t)[1]为负,则sign为负;%360用于调整角度范围至0度-360度,例如:若Ω(t)[0]的角度为-80度,则Ω (t)[0]%360代表-80+360=280度。
可以理解的是,上述公式三只是一种举例,实际应用中,公式三还可以有其他形式,具体此处不做限定。
示例性的,延续上述举例,如图28所示,渲染设备可以显示下拉菜单,该下拉菜单可以包括“朝向控制选项”、“位置控制选项”以及“界面控制选项”。用户可以点击“朝向控制选项”109,进而确定渲染方式为互动渲染中的朝向控制。另外在用户选择朝向控制之后,渲染设备可以显示如图29所示的界面,该界面中,可以由“朝向控制渲染”替换“选择渲染方式”,以提示用户现在的渲染方式为朝向控制。此时,用户可以调整手机的朝向,在用户调整手机的朝向时,如图30所示,渲染设备显示界面中的球视图可以显示虚线,该虚线用于表示现在手机的朝向,使得用户可以直观看出手机在球视图中的朝向,进而方便用户确定第一声源位置。待手机的朝向稳定(可参考前述关于手机摆位稳定的解释,此处不再赘述)后,确定当前手机的第一姿态角。进而通过上述公式三获取第一声源位置。另外,若用户相较于校准时的位置没有发生变化,可以用校准时所获取的手机与外放设备之间的距离作为上述公式三中的d(t)。进而用户基于第一姿态角确定第一发声对象的第一声源位置,或者理解为第一单对象音轨的第一声源位置。进一步的,渲染设备可以显示如图31所示的界面,该界面的球视图中包括与第一声源位置对应的传感器的第一位置信息(即手机的第一位置信息)。
另外,上述举例描述了确定第一声源位置的一种举例,进一步的,用户还可以确定第二单对象音轨的第二声源位置。示例性的,如图32所示,用户可以通过点击“小提琴图标”110确定第二发声对象为小提琴。渲染设备监测手机的姿态角,并通过公式三确定第二声源位置。如图32所示,渲染设备显示界面中的球视图可以显示目前确定的第一发声对象(人)的第一声源位置以及第二发声对象(小提琴)的第二声源位置。
该种方式中,用户可以通过传感器提供的朝向(即第一姿态角)对选定的发声对象进行实时或者后期的动态渲染。此时传感器就类似一个激光笔,激光的指向就是声源位置。这样控制可以赋予发声对象具体的空间方位以及运动,实现用户与音频进行交互创作,提供用户新的体验。
1.2、第一位置信息包括传感器的第二姿态角以及加速度。
用户可以通过第二操作控制传感器设备(例如手机)的位置,从而确定第一声源位置。也可以理解为,渲染设备可以接收手机的第二姿态角(包括方位角、倾斜角以及俯仰角)以及加速度,并利用下述公式四以及公式五获取第一声源位置,该第一声源位置包括方位角、倾斜角以及外放设备与手机之间的距离。即先通过公式四将手机的第二姿态角以及加速度转化为该手机在空间直角坐标系下的坐标,再通过公式五将手机在空间直角坐标系下的坐标转化为手机在球坐标系下的坐标,即第一声源位置。
上述的公式四以及公式五可以如下所述:
公式四:
Figure PCTCN2022087353-appb-000040
Figure PCTCN2022087353-appb-000041
公式五:
Figure PCTCN2022087353-appb-000042
其中,x(t)、y(t)、z(t)为t时刻下手机在空间直角坐标系中的位置信息,g为重力加速度,a(t)为t时刻下手机的加速度,Ω(t)[0]为t时刻下手机的方位角(即手机绕z轴的旋转角度),Ω(t)[1]为t时刻下手机的俯仰角(即手机绕x轴的旋转角度),Ω(t)[2]为t时刻下手机的倾斜角(即手机绕y轴的旋转角度),λ(t)为t时刻下第i个外放设备的方位角,Φ(t)为t时刻下第i个外放设备的倾斜角,d(t)为t时刻下第i个外放设备与手机之间的距离,
可以理解的是,上述公式四以及公式五只是一种举例,实际应用中,公式四以及公式五还可以有其他形式,具体此处不做限定。
示例性的,延续上述举例,若渲染设备显示如图27所示的界面,在用户确定渲染方式为互动渲染之后。渲染设备可以显示下拉菜单,该下拉菜单可以包括“朝向控制选项”、“位置控制选项”以及“界面控制选项”。用户可以点击“朝向控制选项”如图33所示,用户可以点击“位置控制选项”111,进而确定渲染方式为互动渲染中的位置控制。另外在用户选择位置控制之后,渲染设备显示界面中,可以由“位置控制渲染”替换“选择渲染方式”,以提示用户现在的渲染方式为位置控制。此时,用户可以调整手机的位置,待手机的位置稳定(可参考前述关于手机摆位稳定的解释,此处不再赘述)后,确定当前手机的第二姿态角以及加速度。进而通过上述公式四以及公式五获取第一声源位置。进而用户基于第二姿态角以及加速度确定第一发声对象的第一声源位置,或者理解为第一单对象音轨的第一声源位置。进一步的,在用户调整手机的过程中,或者手机位置稳定之后,渲染设备可以显示如图34所示的界面,该界面的球视图中包括与第一声源位置对应的传感器的第一位置信息(即手机的第一位置信息)。使得用户可以直观看出手机在球视图中的位置,进而方便用户确定第一声源位置。如果渲染设备的界面是在用户调整手机位置的过程中在球视图中显示该第一位置信息,可以实时根据手机的位置变化进行变化。
另外,上述举例描述了确定第一声源位置的一种举例,进一步的,用户还可以确定第二单对象音轨的第二声源位置,确定第二声源位置的方式与确定第一声源位置的方式类似,此处不再赘述。
可以理解的是,上述第一位置信息的两种方式只是一种举例,实际应用中,第一位置信息还可以有其他情况,具体此处不做限定。
该种方式中,通过将音频中发声对象对应的单对象音轨分离出来,并通过传感器的实际 位置信息作为声源位置来控制该发声对象对象并进行实时或者后期的动态渲染,发声对象的行动轨迹就可以简单地完全由我们来控制,大大增加编辑的灵活性。
2、参考位置信息是用户选择的第二位置信息。
渲染设备可以提供球视图供用户选择第二位置信息,该球视图的球心为用户所在的位置,球视图的半径为用户的位置与外放设备之间的距离。渲染设备获取用户在该球视图中选择的第二位置信息,将第二位置信息转化为第一声源位置。也可以理解为是渲染设备获取用户在球视图中选择某一点的第二位置信息,在将该点的第二位置信息转化为第一声源位置。该第二位置信息包括用户在球视图中的切面所选择点的二维坐标以及深度(即切面与球心的距离)。
示例性的,延续上述举例,若渲染设备显示如图27所示的界面,在用户确定渲染方式为互动渲染之后。渲染设备可以显示下拉菜单,该下拉菜单可以包括“朝向控制选项”、“位置控制选项”以及“界面控制选项”。用户可以点击“朝向控制选项”如图35所示,用户可以点击“界面控制选项”112,进而确定渲染方式为互动渲染中的界面控制。另外在用户选择界面控制之后,渲染设备可以显示如图36所示的界面,该界面中,可以由“界面控制渲染”替换“选择渲染方式”,以提示用户现在的渲染方式为界面控制。
本申请实施例中的第二位置信息有多种情况,下面分别描述:
2.1、第二位置信息根据用户在垂直切面上选择获取。
渲染设备获取用户在垂直切面上选择点的二维坐标以及该点所在垂直切面与圆心的距离(以下简称深度),并利用下述公式六将该二维坐标以及深度转化为第一声源位置,该第一声源位置包括方位角、倾斜角以及外放设备与手机之间的距离。
示例性的,延续上述举例,进一步的,如果默认跳转垂直切面,如图37所示,渲染设备的界面右侧可以显示球视图、垂直切面、深度控制条。该深度控制条用于调节该垂直切面与球心的距离。用户可以在水平面上点击某一点(x,y)(如114所示),相应的右上角的球视图中会显示该点在球坐标系中的位置。另外,若默认跳转水平切面,用户可以点击球视图中的经线(如图37中的113所示),此时界面会显示如图37所示界面中的垂直切面。当然,用户也可以通过滑动操作调整垂直切面与球心的距离(如图37所示的115所示)。该第二位置信息包括该点的二维坐标(x,y)以及深度r。并利用公式六获取第一声源位置。
上述的公式六可以如下所述:
公式六:
Figure PCTCN2022087353-appb-000043
Figure PCTCN2022087353-appb-000044
Figure PCTCN2022087353-appb-000045
其中,x为用户在垂直切面选择点的横坐标,y为用户在垂直切面选择点的纵坐标、r为深度,λ为第i个外放设备的方位角,Φ为第i个外放设备的倾斜角,d为第i个外放设备 与手机之间的距离(也可以理解为是第i个外放设备与用户之间的距离);%360用于调整角度范围至0度-360度,例如:若
Figure PCTCN2022087353-appb-000046
的角度为-60度,则
Figure PCTCN2022087353-appb-000047
代表-60+360=300度。
可以理解的是,上述公式六只是一种举例,实际应用中,公式六还可以有其他形式,具体此处不做限定。
2.2、第二位置信息根据用户在水平切面上选择获取。
渲染设备获取用户在水平切面上选择点的二维坐标以及该点所在水平切面与圆心的距离(以下简称深度),并利用下述公式七将该二维坐标以及深度转化为第一声源位置,该第一声源位置包括方位角、倾斜角以及外放设备与手机之间的距离。
示例性的,延续上述举例,进一步的,如果默认跳转水平切面,如图38所示,渲染设备的界面右侧可以显示球视图、水平切面、深度控制条。该深度控制条用于调节该水平切面与球心的距离。用户可以在水平面上点击某一点(x,y)(如117所示),相应的右上角的球视图中会显示该点在球坐标系中的位置。另外,若默认跳转垂直切面,用户可以点击球视图中的纬线(如图38中的116所示),此时界面会显示如图38所示界面中的水平切面。当然,用户也可以通过滑动操作调整水平切面与球心的距离(如图38所示的118所示)。该第二位置信息包括该点的二维坐标(x,y)以及深度r。并利用公式七获取第一声源位置。
上述的公式七可以如下所述:
公式七:
Figure PCTCN2022087353-appb-000048
Figure PCTCN2022087353-appb-000049
Figure PCTCN2022087353-appb-000050
其中,x为用户在垂直切面选择点的横坐标,y为用户在垂直切面选择点的纵坐标、r为深度,λ为第i个外放设备的方位角,Φ为第i个外放设备的倾斜角,d为第i个外放设备与手机之间的距离(也可以理解为是第i个外放设备与用户之间的距离);%360用于调整角度范围至0度-360度,例如:若
Figure PCTCN2022087353-appb-000051
的角度为-50度,则
Figure PCTCN2022087353-appb-000052
代表-50+360=310度。
可以理解的是,上述公式七只是一种举例,实际应用中,公式七还可以有其他形式,具 体此处不做限定。
可以理解的是,上述参考位置信息的两种方式只是一种举例,实际应用中,参考位置信息还可以有其他情况,具体此处不做限定。
该种方式中,用户可以通过球视图选择第二位置信息(例如点击、拖拽、滑动等第二操作)来控制选定的发声对象并进行实时或者后期的动态渲染,赋予其具体的空间方位以及运动,可以实现用户与音频进行交互创作,提供用户新的体验。此外,还可以在用户没有任何传感器的情况下对发声对象进行声像的编辑。
第二种,参考信息包括多媒体文件的媒体信息。
本申请实施例中的媒体信息包括:多媒体文件中需要显示的文字、多媒体文件中需要显示的图像、多媒体文件中音乐的音乐特征、第一发声对象对应的声源类型等中的至少一种,具体此处不做限定。
另外,基于多媒体文件中音乐的音乐特征或第一发声对象对应的声源类型确定第一发声对象的第一声源位置可以理解为是自动3D重混。基于多媒体文件中需要显示的位置文字或多媒体文件中需要显示的图像确定第一发声对象的第一声源位置可以理解为是多模态重混。下面分别描述:
示例性的,延续上述举例,渲染设备确定第一发声对象为人声,渲染设备可以显示如图26所示的界面,用户可以点击“选择渲染方式图标”107选择渲染方式,该渲染方式用于确定第一发声对象的第一声源位置。如图39所示,渲染设备响应于点击操作,可以显示下拉菜单,该下拉菜单可以包括“自动渲染选项”与“互动渲染选项”。其中,“互动渲染选项”与参考位置信息对应,“自动渲染选项”与媒体信息对应。进一步的,如图39所示,用户可以点击“自动渲染选项”119,确定渲染方式为自动渲染。
1、自动3D重混。
示例性的,如图39所示,用户可以点击“自动渲染选项”119,渲染设备可以响应于点击操作,显示如图40所示的下拉菜单,该下拉菜单可以包括“自动3D重混选项”以及“多模态重混选项”。进一步的,用户可以点击“自动3D重混选项”120,另外在用户选择自动3D重混之后,渲染设备可以显示如图41所示的界面,该界面中,可以由“自动3D重混”替换“选择渲染方式”,以提示用户现在的渲染方式为自动3D重混。
下面对自动3D重混的多个情况进行描述:
1.1、媒体信息包括多媒体文件中音乐的音乐特征。
本申请实施例中的音乐特征可以是指音乐结构、音乐情感、歌唱模式等中的至少一种。其中,音乐结构可以包括前奏、前奏人声、主歌、过渡段、或副歌等等;音乐情感包括欢快、悲伤、或惊恐等等;歌唱模式包括独唱、合唱、或伴唱等等。
渲染设备确定多媒体文件之后,可以分析多媒体文件的音轨(也可以理解为音频、歌曲等)中的音乐特征。当然,也可以通过人工方式或神经网络的方式识别音乐特征,具体此处不做限定。识别音乐特征之后,可以根据预设的关联关系确定音乐特征对应的第一声源位置,该关联关系为音乐特征与第一声源位置的关系。
示例性的,延续上述举例,渲染设备确定第一声源位置是环绕,渲染设备可以显示如图41所示的界面,该界面的球视图中会显示第一声源位置的运动轨迹。
如上所述,音乐结构一般可以包括前奏、前奏人声、主歌、过渡段、或副歌中的至少一个。下面以分析歌曲结构为例进行示意性说明。
可选地,对歌曲进行人声和乐器声分离,可以是人工分离也可以通过神经网络分离,具体此处不做限定。分离出人声之后,可以通过判断人声的静音段落和音高的方差来进行歌曲分割,具体步骤包括:若人声静音大于一定阈值(例如2秒),则认为段落结束,借此对歌曲的大段落进行划分。若第一个大段落中没有人声,则确定该大段落为乐器前奏;若第一个大段落中由人声,则确定第一大段落为人声前奏。并确定中间静音的大段落为过渡段。进一步的通过下述公式八计算每一个包括人声的大段落(后面称为人声大段落)的中心频率,并计算人声大段落中所有时刻中心频率的方差,依据方差对人声大段落进行排序,方差在前50%的人声大段落标记为副歌,后50%的人声大段落标记为主歌。换句话说,通过频率的波动大小确定歌曲的音乐特征,进而在后续渲染中,对于不同的大段落,可以通过预设的关联关系确定声源位置或者声源位置的运动轨迹,进而对歌曲的不同大段落进行渲染。
示例性的,若音乐特征是前奏,确定第一声源位置是绕用户上方一圈(或理解为环绕),先将多声道下混到单声道(如平均),然后在整个前奏阶段设置整个人声绕头一圈,每个时刻的速度根据人声能量(RMS或方差表征)大小来确定,能量越高,转速越快。若音乐特征为惊恐,确定第一声源位置是忽右忽左。若音乐特征是合唱,可以将左右声道人声扩展拉宽,增加延时。判断每个时间段的乐器数量,若存在乐器solo的情况,则在solo的时间段让乐器根据能量进行绕圈。
上述的公式八可以如下所述:
公式八:
Figure PCTCN2022087353-appb-000053
其中,f c是人声大段落每1秒的中心频率,N是大段落的数量,N为正整数,0<n<N-1。f(n)是大段落对应的时域波形经过傅里叶变换获取的频域,x(n)是某个频率对应的能量。
可以理解的是,上述公式八只是一种举例,实际应用中,公式八还可以有其他形式,具体此处不做限定。
该种方式中,根据音乐的音乐特征来对提取出来的特定发声对象进行方位和动态的设置,使得我们的3D渲染更加自然,艺术性也有更好地体现。
1.2、媒体信息包括第一发声对象对应的声源类型。
本申请实施例中的声源类型可以是人、乐器,也可以是鼓声、琴声等,在实际应用中,可以根据需要进行划分,具体此处不做限定。当然,渲染设备可以通过人工方式或神经网络的方式识别声源类型,具体此处不做限定。
识别声源类型之后,可以根据预设的关联关系确定声源类型对应的第一声源位置,该关联关系为声源类型与第一声源位置的关系(与前述音乐特征类似,此处不再赘述)。
可以理解的是,上述自动3D重混的两种方式只是一种举例,实际应用中,自动3D重混还可以有其他情况,具体此处不做限定。
2、多模态重混。
示例性的,如图42所示,用户可以通过点击“选择输入文件图标”121选取多媒体文件,这里的多媒体文件以“car.mkv”为例。也可以理解为,渲染设备接收用户的第四操作,渲染设备响应该第四操作,从存储区中选择“car.mkv”(即目标文件)为多媒体文件。其中,该存储区可以是渲染设备中的存储区,也可以是外接设备(例如U盘等)中的存储区,具体此处不做限定。在用户选取多媒体文件之后,渲染设备可以显示如图43所示的界面,该界面中,可以由“car.mkv”替换“选择输入文件”,以提示用户现在的多媒体文件是:car.mkv。另外,渲染设备可以利用图4所示实施例中的识别网络和/或分离网络,识别“car.mkv”中的发声对象以及分离出每个发声对象对应的单对象音轨。例如,渲染设备识别出“car.mkv”包括的发声对象为人、车、风声。如图43以及图44所示,渲染设备显示的界面还可以包括对象栏,对象栏中可以显示“人声图标”、“车图标”、“风声图标”等图标,供用户选择待渲染的发声对象。
下面对多模态重混的多个情况进行描述:
2.1、媒体信息包括多媒体文件中需要显示的图像。
可选地,渲染设备获取多媒体文件(含有图像或视频的音轨)之后,可以将视频拆分成帧图像(数量可以是一个或多个),基于帧图像获取第一发声对象的第三位置信息,并基于该第三位置信息获取第一声源位置,该第三位置信息包括第一发声对象在图像内的二维坐标以及深度。
可选地,基于该第三位置信息获取第一声源位置的具体步骤可以包括:将帧图像输入至检测网路,获取该帧图像中第一发声对象对应的追踪框信息(x 0,y 0,w 0,h 0),当然,也可以将帧图像以及第一发声对象作为检测网络的输入,检测网络输出该第一发声对象的追踪框信息。该追踪框信息包括该追踪框某一边角点的二维坐标(x 0,y 0)以及追踪框的高h 0、宽w 0。渲染设备利用公式九计算追踪框信息(x 0,y 0,w 0,h 0)获取追踪框中心点的坐标(x c,y c),然后将该追踪框中心点的坐标(x c,y c)输入至深度估计网络,获取该追踪框内每个点的相对深度
Figure PCTCN2022087353-appb-000054
再利用公式十计算追踪框内每个点的相对深度
Figure PCTCN2022087353-appb-000055
获取该追踪框内所有点的平均深度z c。基于图像的尺寸(高h 1、宽w 1)以及公式十一将前面获取到的(x c,y c,z c)归一化到[-1,1]下的(x norm,y norm,z norm),再基于播放设备信息以及公式十二获取第一声源位置。
上述的公式九可以如下所述:
公式九:
Figure PCTCN2022087353-appb-000056
Figure PCTCN2022087353-appb-000057
公式十:
Figure PCTCN2022087353-appb-000058
公式十一:
Figure PCTCN2022087353-appb-000059
Figure PCTCN2022087353-appb-000060
Figure PCTCN2022087353-appb-000061
公式十二:λ i=x normx_max
Φ i=y normy_max
Figure PCTCN2022087353-appb-000062
其中,(x 0,y 0)是追踪框某一边角点(例如左下角的边角点)的二维坐标,h 0为追踪框的高,w 0为追踪框的宽;h 1为图像的高,w 1为图像的宽w 1
Figure PCTCN2022087353-appb-000063
为追踪框内每个点的相对深度,z c为追踪框内所有点的平均深度;θ x_max为播放设备(如果播放设备是N个外放设备,N个外放设备的播放设备信息是一样的)的最大水平角,θ y_max为播放设备的最大垂直角,d y_max为播放设备的最大深度;λ i为第i个外放设备的方位角,Φ i为第i个外放设备的倾斜角,r i为第i个外放设备与用户之间的距离。
可以理解的是,上述公式九至公式十二只是一种举例,实际应用中,公式九至公式十二还可以有其他形式,具体此处不做限定。
示例性的,如图43所示,用户可以点击“多模态重混选项”122,渲染设备可以响应于点击操作,显示如图43的界面,该界面中的右侧包括“car.mkv”某一帧(例如第一帧)图像以及播放设备信息,该播放设备信息包括最大水平角、最大垂直角以及最大深度。若该播放设备是耳机,则用户可以输入播放设备信息。若该播放设备是外放设备,则用户可以输入播放设备信息或直接采用校准阶段获取的校准信息为播放设备信息,具体此处不做限定。另外,在用户选择多模态重混选项之后,渲染设备可以显示如图44所示的界面,该界面中,可以由“多模态重混”替换“选择渲染方式”,以提示用户现在的渲染方式为多模态重混。
在媒体信息包括多媒体文件中需要显示的图像的情况下,图像中第一发声对象的确定方 式有多种下面分别描述:
(1)第一发声对象由用户在对象栏中的点击确定。
可选地,渲染设备可以基于用户在对象栏中的点击确定第一发声对象。
示例性的,如图44所示,用户可以通过点击“车图标”123确定待渲染的发声对象是车。渲染设备在对右侧“car.mkv”中图像内显示该车的追踪框,进而获取第三位置信息,并通过公式九至公式十二将第三位置信息转化为第一声源位置。另外,该界面还包括追踪框左下角的边角点坐标(x 0,y 0)以及中心点的坐标(x c,y c)。示例性的,外放设备信息中的最大水平角为120度,最大垂直角为60度,最大深度为10(单位可以是米、分米等,具体此处不做限定)。
(2)第一发声对象由用户在图像上的点击确定。
可选地,渲染设备可以将用户在图像中第三操作(例如点击)所确定的发声对象作为第一发声对象。
示例性的,如图45所示,用户可以通过点击图像中的发声对象(如124所示)的方式确定第一发声对象。
(3)第一发声对象根据默认设置确定。
可选地,渲染设备可以通过该图像对应的音轨,识别出发声对象,可以对图像中的默认发声对象或所有发声对象进行跟踪,并确定第三位置信息。该第三位置信息包括图像中发声对象的二维坐标以及发声对象在图像中的深度。
示例性的,渲染设备可以在对象栏中会默认选择“合”,即图像中的所有发声对象均进行跟踪,并分别确定第一发声对象的第三位置信息。
可以理解的是,上述几种确定图像中的第一发声对象的方式只是举例,实际应用中,确定图像中的第一发声对象还可以有其他方式,具体此处不做限定。
该种方式中,结合音频视频图像的多模态,提取发声对象的坐标以及单对象音轨后,在耳机或外放环境下渲染3D沉浸感,做到了真正的音随画动,使得用户获取最优的音效体验。此外,选择发声对象后在整段视频里进行跟踪渲染对象音频的技术也可应用在专业的混音后期制作中,提升混音师的工作效率。通过将视频中音频的单对象音轨分离出来,并通过对视频图像中的发声对象的分析和跟踪,获取发声对象的运动信息,来对选定的发声对象并进行实时或者后期的动态渲染。实现视频画面与音频声源方向的匹配,提升用户体验。
2.2、媒体信息包括多媒体文件中需要显示的位置文字。
该种情况下,渲染设备可以基于多媒体文件中需要显示的位置文字确定第一声源位置,该位置文字用于指示第一声源位置。
可选地,该位置文字可以理解为是具有位置、方位等含义的文字,例如:风往北吹、天堂、地狱、前、后、左、右等,具体此处不做限定。当然,该位置文字具体可以是歌词、字幕、广告语等,具体此处不做限定。
可选地,可以基于强化学习或神经网络识别显示的位置文字的语义,进而根据该语义确定第一声源位置。
该种方式中,通过识别与位置相关的位置文字,在耳机或外放环境下渲染3D沉浸感,做到了与位置文字对应的空间感,使得用户获取最优的音效体验。
可以理解的是,上述媒体信息的两种方式只是一种举例,实际应用中,媒体信息还可以有其他情况,具体此处不做限定。
另外,在步骤1003中,对如何基于参考信息确定第一声源位置进行了多种情况的描述,在实际应用中,也可以通过组合的方式确定第一声源位置。例如:通过传感器朝向确定第一声源位置之后,再通过音乐特征确定该第一声源位置的运动轨迹。示例性的,如图46所示,渲染设备基于传感器的第一姿态角已确定人声的声源位置为图46界面中的右侧所示。进一步的,用户可以通过点击“人声图标”右侧菜单中的“转圈选项”125,进而确定人声的运动轨迹。也可以理解为先用传感器朝向确定某一时刻的第一声源位置,在使用音乐特征或预设规则确定该第一声源位置的运动轨迹为转圈。相应的,如图46所示,渲染设备的界面可以显示该发生对象的运动轨迹。
可选地,在上述确定第一声源位置的过程中,用户可以通过控制手机的音量键或在球视图上的点击、拖拽、滑动等来控制第一声源位置中的距离。
步骤1004、基于第一声源位置对第一单对象音轨进行空间渲染。
渲染设备确定第一声源位置之后,可以对第一单对象音轨进行空间渲染,获取渲染后的第一单对象音轨。
可选地,渲染设备基于第一声源位置对第一单对象音轨进行空间渲染,获取渲染后的第一单对象音轨。当然,渲染设备也可以基于第一声源位置对第一单对象音轨进行空间渲染,并基于第二声源位置对第二单对象音轨进行渲染,获取渲染后的第一单对象音轨以及第二单对象音轨。
可选地,在对原始音轨中多个发声对象对应的多个单对象音轨进行空间渲染的情况下,本申请实施例中确定声源位置的方法可以应用步骤1003中的多种方式的组合,具体此处不做限定。
示例性的,如图32所示,第一发声对象为人,第二发声对象为小提琴,第一发声对象对应的第一单对象音轨的第一声源位置可以采用互动渲染中的某一个方式,第二发声对象对应的第二单对象音轨的第二声源位置可以采用自动渲染中的某一个方式。对于第一声源位置以及第二声源位置的具体确定方式可以是前述步骤1003中的任意两种,当然,第一声源位置与第二声源位置的具体确定方式也可以采用相同的方式,具体此处不做限定。
另外,上述含有球视图的附图中,该球视图中还可以包括音量条,用户可以通过对该音量条进行手指滑动、鼠标拖拽、鼠标滑轮等操作控制第一单对象音轨的音量,提升渲染音轨的实时性。示例性的,如图47所示,用户可以调节音量条126从而对吉他对应的单对象音轨的音量进行调整。
根据播放设备类型的不同,本步骤的渲染方式可能有所不同,也可以理解为渲染设备基于第一声源位置以及播放设备的类型对原始音轨或第一单对象音轨进行空间渲染所采用的方法,根据播放设备类型的不同而不同,下面分别描述:
第一种,播放设备类型为耳机。
该种情况下,渲染设备确定第一声源位置之后,可以基于公式十三以及HRTF滤波器系数表对音轨进行渲染。该音轨可以是第一单对象音轨,也可以是二单对象音轨,还可以是第一单对象音轨以及第二单对象音轨,具体此处不做限定。其中,HRTF滤波器系数表用于表示声 源位置与系数之间的关联关系,也可以理解为一个声源位置对应一个HRTF滤波器系数。
上述的公式十三可以如下所述:
公式十三:
Figure PCTCN2022087353-appb-000064
其中,
Figure PCTCN2022087353-appb-000065
为渲染后的第一单对象音轨,S为多媒体文件的发声对象且包括第一发声对象,i指示左声道或右声道,a s(t)为t时刻下第一发声对象的调节系数,h i,s(t)为t时刻下第一发声对象对应的左声道或右声道的头相关传输函数HRTF滤波器系数,其中,t时刻下第一发声对象对应的左声道的HRTF滤波器系数与t时刻下第一发声对象对应的右声道的HRTF滤波器系数一般情况下不相同,HRTF滤波器系数与第一声源位置相关,o s(t)为t时刻下的第一单对象音轨,τ为积分项。
可以理解的是,上述公式十三只是一种举例,实际应用中,公式十三还可以有其他形式,具体此处不做限定。
第二种,播放设备类型为外放设备。
该种情况下,渲染设备确定第一声源位置之后,可以基于公式十四对音轨进行渲染。该音轨可以是第一单对象音轨,也可以是二单对象音轨,还可以是第一单对象音轨以及第二单对象音轨,具体此处不做限定。
上述的公式十四可以如下:
公式十四:
Figure PCTCN2022087353-appb-000066
Figure PCTCN2022087353-appb-000067
Figure PCTCN2022087353-appb-000068
其中,外放设备的数量可以是N个,
Figure PCTCN2022087353-appb-000069
为渲染后的第一单对象音轨,i指示多声道中的第i个声道,S为多媒体文件的发声对象且包括第一发声对象,a s(t)为t时刻下第一发声对象的调节系数,g s(t)代表t时刻下第一发声对象的平移系数,o s(t)为t时刻下的第一单对象音轨,λ i为校准器(例如前述的传感器设备)校准第i个外放设备所获取的方位角,Φ i为校准器校准第i个外放设备所获取的倾斜角,r i为第i个外放设备与 校准器的距离,N为正整数,i为正整数且i≤N,第一声源位置在N个外放设备构成的四面体内。
另外,对于原始音轨的空间渲染,可以是对原始音轨中的某一个发声对象对应的单对象音轨进行渲染并替换例如上述公式中的S 1。也可以是对原始音轨中的某一个发声对象对应的单对象音轨进行复制增加后的渲染,例如上述公式中的S 2。当然,也可以是上述S 1与S 2的结合。
可以理解的是,上述公式十四只是一种举例,实际应用中,公式十四还可以有其他形式,具体此处不做限定。
为了方便理解N,请参阅图48,该图为外放设备在球坐标系中的架构示意图,其中,当发声对象的声源位置在4个外放设备构成的四面体内,则N=4,当发声对象的声源位置在3个外放设备构成的区域的面上,则N=3,如果在两个外放设备连线上则N=2,当直接指向一个外放设备的时候N=1。图48中的点由于在外放设备1与外放设备2之间的连线上,因此图48所示的N为2。
步骤1005、基于渲染后的第一单对象音轨获取目标音轨。
根据播放设备类型的不同,本步骤获取的目标音轨可能有所不同,也可以理解为渲染设备获取目标音轨所采用的方法,根据播放设备类型的不同而不同,下面分别描述:
第一种,播放设备类型为耳机。
该种情况下,渲染设备获取渲染后的第一单对象音轨和/或第二单对象音轨之后,可以基于公式十五以及渲染后的音轨获取目标音轨。该音轨可以是第一单对象音轨,也可以是二单对象音轨,还可以是第一单对象音轨以及第二单对象音轨,具体此处不做限定。
上述的公式十五可以如下:
公式十五:
Figure PCTCN2022087353-appb-000070
其中,i指示左声道或右声道,
Figure PCTCN2022087353-appb-000071
为t时刻下的目标音轨,X i(t)为t时刻下的原始音轨,
Figure PCTCN2022087353-appb-000072
为t时刻下未被渲染的第一单对象音轨,
Figure PCTCN2022087353-appb-000073
为渲染后的第一单对象音轨,a s(t)为t时刻下第一发声对象的调节系数,h i,s(t)为t时刻下第一发声对象对应的左声道或右声道的头相关传输函数HRTF滤波器系数,HRTF滤波器系数与第一声源位置相关,o s(t)为t时刻下的第一单对象音轨,τ为积分项,S 1为原始音轨中需要被替换的发声对象,若第一发声对象是替换原始音轨中的发声对象,则S 1为空集;S 2为目标音轨相较于原始音轨增加的发声对象,若第一发声对象是复制原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为多媒体文件的发声对象且包括第一发声对象。若S 2为空集合,可以理解为对音轨的空间渲染是替换发声对象,对该发声对象对应的单对象音轨进行空间渲染后,并用渲染后的单对象音轨替换多媒体文件中的初始单对象音轨。换句话说,目标音轨相比多媒体文件并没有多发声对象对应的单对象音轨,而是用渲染后的单对象音轨替换掉多媒体文件中的初始单对象音轨。
可以理解的是,上述公式十五只是一种举例,实际应用中,公式十五还可以有其他形式,具体此处不做限定。
第二种,播放设备类型为外放设备。
该种情况下,渲染设备获取渲染后的第一单对象音轨和/或第二单对象音轨之后,可以基于公式十六以及渲染后的音轨获取目标音轨。该音轨可以是第一单对象音轨,也可以是二单对象音轨,还可以是第一单对象音轨以及第二单对象音轨,具体此处不做限定。
上述的公式十六可以如下:
公式十六:
Figure PCTCN2022087353-appb-000074
Figure PCTCN2022087353-appb-000075
Figure PCTCN2022087353-appb-000076
其中,外放设备的数量可以是N个,i指示多声道中的某一声道,
Figure PCTCN2022087353-appb-000077
为t时刻下的目标音轨,X i(t)为t时刻下的原始音轨,
Figure PCTCN2022087353-appb-000078
为t时刻下未被渲染的第一单音轨,
Figure PCTCN2022087353-appb-000079
为渲染后的第一单对象音轨,a s(t)为t时刻下第一发声对象的调节系数,g s(t)代表t时刻下第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为t时刻下的第一单对象音轨,S 1为原始音轨中需要被替换的发声对象,若第一发声对象是替换原始音轨中的发声对象,则S 1为空集;S 2为目标音轨相较于原始音轨增加的发声对象,若第一发声对象是复制原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为多媒体文件的发声对象且包括第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为校准器校准第i个外放设备所获取倾斜角,r i为第i个外放设备与校准器的距离,N为正整数,i为正整数且i≤N,第一声源位置在N个外放设备构成的四面体内。
可以理解的是,上述公式十六只是一种举例,实际应用中,公式十六还可以有其他形式,具体此处不做限定。
当然,也可以根据多媒体文件以及目标音轨生成新的多媒体文件,具体此处不做限定。
另外,用户在对第一单对象音轨进行渲染之后,可以将自己在对渲染过程中对于声源位 置的设置方式上传至前述图8对应的数据库模块中,方便其他用户采用该设置方式来渲染其他音轨。当然,用户也可以从数据库模块中下载设置方式并进行修改,方便用户对音轨的空间渲染。该种方式中,增加了针对渲染规则的修改以及不同用户间共享,一方面在多模态模式下可以避免针对同一文件的重复进行对象识别跟踪,降低端侧的开销;另一方面,可以将用户在互动模式下的自由创作分享给其他用户,进一步增强应用的交互性。
示例性的,如图49所示,用户可以选择将本地数据库中存储的渲染规则文件同步到用户的其他设备。如图50所示,用户可以选择将本地数据库中存储的渲染规则文件上传到云端,用于分享给其他用户,其他用户可以选择从云端数据库下载相应的渲染规则文件到端侧。
数据库中存储的元数据文件,主要用于自动模式下对***分离出的发声对象或用户指定的对象进行渲染,或者混合模式下对用户指定的需要按照存储的渲染规则进行自动渲染的发声对象进行渲染。数据库中存储的元数据文件,可以是***预制的,如表1。
表1
Figure PCTCN2022087353-appb-000080
其中,表1中的序号1、2;也可以是用户在使用本发明互动模式时创作产生的,如表1中的序号3-6;还可以是多模态模式下***自动识别视频画面中指定发声对象的运动轨迹后存储下来的,如表1中的序号7。元数据文件可以是与多媒体文件中的音频内容或多模态文件内容强相关的:例如表1中序号3为音频文件A1对应的元数据文件,序号4为音频文件A2对应的元数据文件;也可以是与多媒体文件解耦的:用户在使用互动模式下对音频文件A的对象X进行互动操作,将对象X的运动轨迹保存为对应的元数据文件(例如表1中的序号5自由盘旋上升状态),在下一次使用自动渲染的时候,用户可以从数据库模块中选择自由盘旋上升状态的元数据文件对音频文件B的对象Y进行渲染。
一种可能实现的方式中,本申请实施例提供的渲染方法包括步骤1001至步骤1005。另 一种可能实现的方式中,本申请实施例提供的渲染方法包括步骤1002至步骤1005。另一种可能实现的方式中,本申请实施例提供的渲染方法包括步骤1001至步骤1004。另一种可能实现的方式中,本申请实施例提供的渲染方法包括步骤1002至步骤1004。另外,本申请实施例中图10所示的各个步骤不限定时序关系。例如:上述方法中的步骤1001也可以在步骤1002之后,即在获取音轨之后,在对播放设备进行校准。
本申请实施例中,用户可以通过手机传感器对音频特定发声对象的声像,数量,音量进行控制、通过手机界面拖拽特定发声对象的声像,数量,音量进行控制、通过自动化的规则对音乐中的特定发声对象进行空间渲染提升空间性、通过多模态识别自动渲染声源位置、以及通过针对单发声对象渲染的方法,提供了一种与传统音乐电影交互模式完全不同的音效体验。为音乐欣赏提供了一种全新的交互方式。其中自动化3D重制作提升了双声道音乐的空间感,让听音提升了一个新的层次。另外分离的加入已经我们设计的交互方法,增强用户对音频的编辑能力,可应用于音乐,影视作品的发声对象制作,简单地进行特定发声对象运动信息的编辑。也增加用户对音乐的操控性与可玩性,让用户体验自己制作音频的乐趣,控制特定发声对象的能力。
本申请除了上述的训练方法以及渲染方法,还提供两种应用上述渲染方法的具体应用场景,下面分别描述:
第一种,“猎音人”游戏场景。
该场景也可以理解为是用户通过指向声源位置,并判断该用户的指向是否与实际声源位置符合,对用户的操作进行打分,提升用户的娱乐体验。
示例性的,延续上述举例,在用户确定多媒体文件并对单对象音轨进行渲染之后,如图51所示,用户可以点击“猎音人图标”126,进而进入到猎音人游戏场景,渲染设备可以显示如图51所示的界面,用户可以点击界面下方的播放按钮确定游戏开始,播放设备会按一定顺序以及任意位置播放至少一个单对象音轨。在播放设备播放钢琴的单对象音轨时,用户凭听觉判断声源位置,并持手机指向用户判断的声源位置,若用户手机指向的位置与该钢琴的单对象音轨的实际声源位置一致(或者误差在一定范围内),渲染设备可以显示如图51右侧界面的提示“击中第一种乐器,用时5.45秒,击败全宇宙99.33%的人”。另外,在用户指向某个发声对象的位置正确之后,对象栏中的相应发声对象可以从红色变为绿色。当然,若用户在一定时间段内都未正确指向声源位置,则可以显示失败。在第一个单对象音轨播放完毕的预设时间段(如图54中的时间间隔T)之后,播放下一个单对象音轨继续游戏,如图52以及图53所示。另外,在用户指向某个发声对象的位置错误之后,对象栏中的相应发声对象可以保持红色。以此类推(如图54所示),在用户按下界面下方的暂停或单对象音轨播放完毕之后,确认游戏结束。进一步的,在游戏结束后,若用户的几次指向都正确,渲染设备可以显示如图53所示的界面。
该场景下,通过用户与播放***的实时信息交互,实时在播放***中渲染对象音频的方位,设计成游戏可以使用户获取极致的“听音辨位”体验。可以应用于家庭娱乐、AR、VR游戏等。相较于现有技术中关于“听音辩位”的技术都只是针对一整首歌曲,本申请提供一种针对一首歌曲分离人声乐器后再进行的游戏。
第二种,多人互动场景。
该场景可以理解为是多个用户通过分别控制特定发声对象的声源位置,进而实现多人对音轨进行渲染,增加多个用户之间的娱乐与沟通。例如:该互动场景具体可以是线上多人组乐队或线上主播控制交响乐等。
示例性的,多媒体文件为多种乐器合奏的音乐,用户A可以选择多人交互模式,邀请用户B共同完成创作,每个用户可以选择不同的乐器作为交互的发声对象进行控制,并根据发声对象所对应的用户给出的渲染轨迹分别进行渲染后完成重混,然后将重混后的音频文件发送给参与的各个用户。不同用户选择的交互方式可以不同,具体此处不做限定。例如,如图55所示,用户A选择交互模式通过改变其使用的手机的朝向对对象A对的位置进行交互控制,用户B选择交互模式通过改变其使用的手机的朝向对对象B对的位置进行交互控制。如图56所示,***(渲染设备或云端服务器)可以将重新混音后的音频文件发送给参与多人交互应用的各个用户,该音频文件中对象A的位置和对象B的位置分别于用户A和用户B的操控对应。
上述中用户A与用户B的具体交互过程的一种示例:用户A选择输入多媒体文件,***识别输入文件中的对象信息,并通过UI界面反馈给用户A。用户A进行模式选择,若用户A选择了多人互动模式,用户A向***发送多人交互请求,并将指定邀请对象的信息发送给***。***响应于该请求,向用户A选择的用户B发送交互请求。用户B若接受该请求,发送接受指令给***以加入到用户A创建的多人交互应用。用户A和用户B分别选定操作的发声对象,并针对选定的发声对象采用上述渲染模式进行操控,并将对应的渲染规则文件。***通过分离网络进行单对象音轨的分离,并根据与发声对象对应的用户提供的渲染轨迹对分离后的单对象音轨分别进行渲染,再将渲染后的单对象音轨重新混合获取目标音轨,然后将目标音轨发送给参与的各个用户。
另外,多人互动模式可以是上面的例子中描述的实时在线的多人互动,也可以是离线情况下的多人互动。例如,用户A选择的多媒体文件为对唱音乐,包含演唱者A和演唱者B。图57所示,用户A可以选择交互模式对演唱者A的渲染效果进行操控,并将重新渲染后的目标音轨分享给用户B;用户B可以将收到的用户A分享的目标音轨作为输入文件,对演唱者B的渲染效果进行操控。其中,不同用户选择的交互方式可以相同或不同,具体此处不做限定。
可以理解的是,上述几种应用场景只是一种举例,实际应用中,还可以有其他应用场景,具体此处不做限定。
该种场景下,支持多人参与的实时以及非实时的互动渲染控制,用户可以邀请其他用户共同完成对多媒体文件的不同发声对象进行重新渲染创作,增强交互体验以及应用的趣味性。利用以上方式实现的多人协同进行不同对象声像控制,进而实现多人对多媒体文件的渲染。
上面对本申请实施例中的渲染方法进行了描述,下面对本申请实施例中的渲染设备进行描述,请参阅图58,本申请实施例中渲染设备的一个实施例包括:
获取单元5801,用于基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;
确定单元5802,用于基于参考信息确定第一发声对象的第一声源位置,参考信息包括参考位置信息和/或多媒体文件的媒体信息,参考位置信息用于指示第一声源位置;
渲染单元5803,用于基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后 的第一单对象音轨。
本实施例中,渲染设备中各单元所执行的操作与前述图5至图11所示实施例中描述的类似,此处不再赘述。
本实施例中,获取单元5801基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;确定单元5802基于参考信息确定第一发声对象的第一声源位置,渲染单元5803基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨音轨。可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。
请参阅图59,本申请实施例中渲染设备的另一个实施例包括:
获取单元5901,用于基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;
确定单元5902,用于基于参考信息确定第一发声对象的第一声源位置,参考信息包括参考位置信息和/或多媒体文件的媒体信息,参考位置信息用于指示第一声源位置;
渲染单元5903,用于基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
本实施例中的渲染设备还包括:
提供单元5904,用于提供球视图供用户选择,球视图的圆心为用户所在的位置,球视图的半径为用户的位置与播放设备的距离;
发送单元5905,用于向播放设备发送目标音轨,播放设备用于播放目标音轨。
本实施例中,渲染设备中各单元所执行的操作与前述图5至图11所示实施例中描述的类似,此处不再赘述。
本实施例中,获取单元5901基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;确定单元5902基于参考信息确定第一发声对象的第一声源位置,渲染单元5903基于第一声源位置对第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨音轨。可以提升多媒体文件中第一发声对象对应的第一单对象音轨的立体空间感,为用户提供身临其境的立体音效。此外,提供了一种与传统音乐电影交互模式完全不同的音效体验。为音乐欣赏提供了一种全新的交互方式。其中自动化3D重制作提升了双声道音乐的空间感,让听音提升了一个新的层次。另外分离的加入已经我们设计的交互方法,增强用户对音频的编辑能力,可应用于音乐,影视作品的发声对象制作,简单地进行特定发声对象运动信息的编辑。也增加用户对音乐的操控性与可玩性,让用户体验自己制作音频的乐趣,控制特定发声对象的能力。
请参阅图60,本申请实施例中渲染设备的一个实施例包括:
获取单元6001,用于获取多媒体文件;
获取单元6001,还用于基于多媒体文件获取第一单对象音轨,第一单对象音轨与第一发声对象对应;
显示单元6002,用于显示用户界面,用户界面包括渲染方式选项;
确定单元6003,用于响应用户在用户界面的第一操作,从渲染方式选项中确定自动渲染方式或互动渲染方式;
获取单元6001,还用于当确定单元确定的是自动渲染方式时,基于预设方式获取渲染后的第一单对象音轨;或
获取单元6001,还用于当确定单元确定的是互动渲染方式时,响应于用户的第二操作以获得参考位置信息;基于参考位置信息确定第一发声对象的第一声源位置;基于第一声源位置对第一单对象音轨进行渲染,以获取渲染后的第一单对象音轨。
本实施例中,渲染设备中各单元所执行的操作与前述图5至图11所示实施例中描述的类似,此处不再赘述。
本实施例中,确定单元6003根据用户的第一操作,从渲染方式选项中确定自动渲染方式或互动渲染方式,一方面,获取单元6001可以基于用户的第一操作自动获取渲染后的第一单对象音轨。另一方面,可以通过渲染设备与用户之间的交互,实现多媒体文件中第一发声对象对应的音轨的空间渲染,为用户提供身临其境的立体音效。
参阅图61,本申请提供的另一种渲染设备的结构示意图。该渲染设备可以包括处理器6101、存储器6102和通信接口6103。该处理器6101、存储器6102和通信接口6103通过线路互联。其中,存储器6102中存储有程序指令和数据。
存储器6102中存储了前述图5至图11所示对应的实施方式中,由渲染设备执行的步骤对应的程序指令以及数据。
处理器6101,用于执行前述图5至图11所示实施例中任一实施例所示的由渲染设备执行的步骤。
通信接口6103可以用于进行数据的接收和发送,用于执行前述图5至图11所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,渲染设备可以包括相对于图61更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
本申请实施例还提供了一种传感器设备,如图62所示,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该传感器设备可以为包括手机、平板电脑等任意终端设备,以传感器是手机为例:
图62示出的是与本申请实施例提供的传感器设备-手机的部分结构的框图。参考图62,手机包括:射频(radio frequency,RF)电路6210、存储器6220、输入单元6230、显示单元6240、传感器6250、音频电路6260、无线保真(wireless fidelity,WiFi)模块6270、处理器6280、以及电源6290等部件。本领域技术人员可以理解,图62中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图62对手机的各个构成部件进行具体的介绍:
RF电路6210可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器6280处理;另外,将设计上行的数据发送给基站。通常,RF电路6210包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(low noise amplifier,LNA)、双工器等。此外,RF电路6210还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯***(global system of mobile communication,GSM)、通用分组无线服务(general packet radio  service,GPRS)、码分多址(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、长期演进(long term evolution,LTE)、电子邮件、短消息服务(short messaging service,SMS)等。
存储器6220可用于存储软件程序以及模块,处理器6280通过运行存储在存储器6220的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器6220可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器6220可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元6230可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元6230可包括触控面板6231以及其他输入设备6232。触控面板6231,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板6231上或在触控面板6231附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板6231可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器6280,并能接收处理器6280发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板6231。除了触控面板6231,输入单元6230还可以包括其他输入设备6232。具体地,其他输入设备6232可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元6240可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元6240可包括显示面板6241,可选的,可以采用液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板6241。进一步的,触控面板6231可覆盖显示面板6241,当触控面板6231检测到在其上或附近的触摸操作后,传送给处理器6280以确定触摸事件的类型,随后处理器6280根据触摸事件的类型在显示面板6241上提供相应的视觉输出。虽然在图62中,触控面板6231与显示面板6241是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板6231与显示面板6241集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器6250,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板6241的亮度,接近传感器可在手机移动到耳边时,关闭显示面板6241和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线、IMU、SLAM传感器等其他传感器,在此不再赘述。
音频电路6260、扬声器6262,传声器6262可提供用户与手机之间的音频接口。音频电路6260可将接收到的音频数据转换后的电信号,传输到扬声器6262,由扬声器6262转换为声音信号输出;另一方面,传声器6262将收集的声音信号转换为电信号,由音频电路6260接收后转换为音频数据,再将音频数据输出处理器6280处理后,经RF电路6210以发送给比如另一手机,或者将音频数据输出至存储器6220以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块6270可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图62示出了WiFi模块6270,但是可以理解的是,其并不属于手机的必须构成。
处理器6280是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器6220内的软件程序和/或模块,以及调用存储在存储器6220内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器6280可包括一个或多个处理单元;优选的,处理器6280可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器6280中。
手机还包括给各个部件供电的电源6290(比如电池),优选的,电源可以通过电源管理***与处理器6280逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
在本申请实施例中,该手机所包括的处理器6280可以执行前述图5至图11所示实施例中的功能,此处不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个 人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、***、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。

Claims (38)

  1. 一种渲染方法,其特征在于,包括:
    基于多媒体文件获取第一单对象音轨,所述第一单对象音轨与第一发声对象对应;
    基于参考信息确定所述第一发声对象的第一声源位置,所述参考信息包括参考位置信息和/或所述多媒体文件的媒体信息,所述参考位置信息用于指示所述第一声源位置;
    基于所述第一声源位置对所述第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
  2. 根据权利要求1所述的方法,其特征在于,所述媒体信息包括:所述多媒体文件中需要显示的文字、所述多媒体文件中需要显示的图像、所述多媒体文件中需要播放的音乐的音乐特征以及所述第一发声对象对应的声源类型中的至少一种。
  3. 根据权利要求1或2所述的方法,其特征在于,所述参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述方法还包括:
    确定播放设备的类型,所述播放设备用于播放目标音轨,所述目标音轨根据所述渲染后的第一单对象音轨获取;
    所述基于所述第一声源位置对所述第一单对象音轨进行空间渲染,包括:
    基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染。
  5. 根据权利要求2所述的方法,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述图像且所述图像包括所述第一发声对象时,所述基于参考信息确定所述第一发声对象的第一声源位置,包括:
    确定所述图像内所述第一发声对象的第三位置信息,所述第三位置信息包括所述第一发声对象在所述图像内的二维坐标以及深度;
    基于所述第三位置信息获取所述第一声源位置。
  6. 根据权利要求2或5所述的方法,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要播放的音乐的音乐特征时,所述基于参考信息确定所述第一发声对象的第一声源位置,包括:
    基于关联关系与所述音乐特征确定所述第一声源位置,所述关联关系用于表示所述音乐特征与所述第一声源位置的关联。
  7. 根据权利要求2所述的方法,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要显示的文字且所述文字包含有与位置相关的位置文字时,所述基于参考信息确定所述第一发声对象的第一声源位置,包括:
    识别所述位置文字;
    基于所述位置文字确定所述第一声源位置。
  8. 根据权利要求3所述的方法,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第一位置信息时,所述基于参考信息确定所述第一发声对象的第一声源位置前,所述方法还包括:
    获取所述第一位置信息,所述第一位置信息包括所述传感器的第一姿态角以及所述传感器与播放设备之间的距离;
    所述基于参考信息确定所述第一发声对象的第一声源位置,包括:
    将所述第一位置信息转化为所述第一声源位置。
  9. 根据权利要求3所述的方法,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第一位置信息时,所述基于参考信息确定所述第一发声对象的第一声源位置前,所述方法还包括:
    获取所述第一位置信息,所述第一位置信息包括所述传感器的第二姿态角以及所述传感器的加速度;
    所述基于参考信息确定所述第一发声对象的第一声源位置,包括:
    将所述第一位置信息转化为所述第一声源位置。
  10. 根据权利要求3所述的方法,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第二位置信息时,所述基于参考信息确定所述第一发声对象的第一声源位置前,所述方法还包括:
    提供球视图供用户选择,所述球视图的圆心为所述用户所在的位置,所述球视图的半径为所述用户的位置与播放设备的距离;
    获取用户在所述球视图中选择的所述第二位置信息;
    所述基于参考信息确定所述第一发声对象的第一声源位置,包括:
    将所述第二位置信息转化为所述第一声源位置。
  11. 根据权利要求1至10任一项所述的方法,其特征在于,所述基于多媒体文件获取第一单对象音轨,包括:
    从所述多媒体文件中的原始音轨中分离出所述第一单对象音轨,所述原始音轨至少由所述第一单对象音轨以及第二单对象音轨合成获取,所述第二单对象音轨与第二发声对象对应。
  12. 根据权利要求11所述的方法,其特征在于,所述从所述多媒体文件中的原始音轨中分离出所述第一单对象音轨,包括:
    通过训练好的分离网络从所述原始音轨中分离出所述第一单对象音轨。
  13. 根据权利要求12所述的方法,其特征在于,所述训练好的分离网络是通过以训练数据作为所述分离网络的输入,以损失函数的值小于第一阈值为目标对分离网络进行训练获取,所述训练数据包括训练音轨,所述训练音轨至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,所述初始第三单对象音轨与第三发声对象对应,所述初始第四单对象音轨与第四发声对象对应,所述第三发声对象与所述第一发声对象的属于相同类型,所述第二发声对象与所述第四发声对象的属于相同类型,所述分离网络的输出包括分离获取的第三单对象音轨;
    所述损失函数用于指示所述分离获取的第三单对象音轨与所述初始第三单对象音轨之间的差异。
  14. 根据权利要求4所述的方法,其特征在于,所述基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染,包括:
    若所述播放设备为耳机,通过如下公式获取所述渲染后的第一单对象音轨;
    Figure PCTCN2022087353-appb-100001
    其中,
    Figure PCTCN2022087353-appb-100002
    为所述渲染后的第一单对象音轨,S为所述多媒体文件的发声对象且包括所述第一发声对象,i指示左声道或右声道,a s(t)为t时刻下所述第一发声对象的调节系数,h i,s(t)为t时刻下所述第一发声对象对应的所述左声道或所述右声道的头相关传输函数HRTF滤波器系数,所述HRTF滤波器系数与所述第一声源位置相关,o s(t)为所述t时刻下的所述第一单对象音轨,τ为积分项。
  15. 根据权利要求4所述的方法,其特征在于,所述基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染,包括:
    若所述播放设备为N个外放设备,通过如下公式获取所述渲染后的第一单对象音轨;
    Figure PCTCN2022087353-appb-100003
    其中,
    Figure PCTCN2022087353-appb-100004
    其中,
    Figure PCTCN2022087353-appb-100005
    其中,
    Figure PCTCN2022087353-appb-100006
    为所述渲染后的第一单对象音轨,i指示多声道中的第i个声道,S为所述多媒体文件的发声对象且包括所述第一发声对象,a s(t)为t时刻下所述第一发声对象的调节系数,g s(t)代表所述t时刻下所述第一发声对象的平移系数,o s(t)为所述t时刻下的所述第一单对象音轨,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为所述校准器校准所述第i个外放设备所获取的倾斜角,r i为所述第i个外放设备与所述校准器的距离,N为正整数,i为正整数且i≤N,所述第一声源位置在所述N个外放设备构成的四面体内。
  16. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    基于所述渲染后的第一单对象音轨、所述多媒体文件中的原始音轨以及所述播放设备的类型,获取目标音轨;
    向所述播放设备发送所述目标音轨,所述播放设备用于播放所述目标音轨。
  17. 根据权利要求16所述的方法,其特征在于,所述基于所述渲染后的第一单对象音轨、所述多媒体文件中的原始音轨以及所述播放设备的类型,获取目标音轨,包括:
    若所述播放设备的类型为耳机,通过如下公式获取所述目标音轨:
    Figure PCTCN2022087353-appb-100007
    其中,i指示左声道或右声道,
    Figure PCTCN2022087353-appb-100008
    为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨,
    Figure PCTCN2022087353-appb-100009
    为所述t时刻下未被渲染的所述第一单对象音轨,
    Figure PCTCN2022087353-appb-100010
    为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,h i,s(t)为所述t时刻下所述第一发声对象对应的所述左声道或所述右声道的头相关传输函数HRTF滤波器系数,所述HRTF滤波器系数与所述第一声源位置相关,o s(t)为所述t时刻下的所述第一单对象音轨,τ为积分项,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制的所述原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象。
  18. 根据权利要求16所述的方法,其特征在于,所述基于所述渲染后的第一单对象音轨、所述多媒体文件中的原始音轨以及所述播放设备的类型,获取目标音轨,包括:
    若所述播放设备的类型为N个外放设备,通过如下公式获取所述目标音轨:
    Figure PCTCN2022087353-appb-100011
    其中,
    Figure PCTCN2022087353-appb-100012
    Figure PCTCN2022087353-appb-100013
    其中,i指示多声道中的第i个声道,
    Figure PCTCN2022087353-appb-100014
    为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨,
    Figure PCTCN2022087353-appb-100015
    为所述t时刻下未被渲染的所述第一单音轨,
    Figure PCTCN2022087353-appb-100016
    为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,g s(t)代表所述t时刻下所述第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为所述t时刻下的所述第一单对象音轨,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制所述原 始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为所述校准器校准所述第i个外放设备所获取倾斜角,r i为所述第i个外放设备与所述校准器的距离,N为正整数,i为正整数且i≤N,所述第一声源位置在所述N个外放设备构成的四面体内。
  19. 一种渲染设备,其特征在于,包括:
    获取单元,用于基于多媒体文件获取第一单对象音轨,所述第一单对象音轨与第一发声对象对应;
    确定单元,用于基于参考信息确定所述第一发声对象的第一声源位置,所述参考信息包括参考位置信息和/或所述多媒体文件的媒体信息,所述参考位置信息用于指示所述第一声源位置;
    渲染单元,用于基于所述第一声源位置对所述第一单对象音轨进行空间渲染,以获取渲染后的第一单对象音轨。
  20. 根据权利要求19所述的渲染设备,其特征在于,所述媒体信息包括:所述多媒体文件中需要显示的文字、所述多媒体文件中需要显示的图像、所述多媒体文件中需要播放的音乐的音乐特征以及所述第一发声对象对应的声源类型中的至少一种。
  21. 根据权利要求19或20所述的渲染设备,其特征在于,所述参考位置信息包括传感器的第一位置信息或用户选择的第二位置信息。
  22. 根据权利要求19至21中任一项所述的渲染设备,其特征在于,所述确定单元,还用于确定播放设备的类型,所述播放设备用于播放目标音轨,所述目标音轨根据所述渲染后的第一单对象音轨获取;
    所述渲染单元,具体用于基于所述第一声源位置以及所述播放设备的类型对所述第一单对象音轨进行空间渲染。
  23. 根据权利要求20所述的渲染设备,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述图像且所述图像包括所述第一发声对象时,所述确定单元,具体用于确定所述图像内所述第一发声对象的第三位置信息,所述第三位置信息包括所述第一发声对象在所述图像内的二维坐标以及深度;
    所述确定单元,具体用于基于所述第三位置信息获取所述第一声源位置。
  24. 根据权利要求20或23所述的渲染设备,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要播放的音乐的音乐特征时,所述确定单元,具体用于基于关联关系与所述音乐特征确定所述第一声源位置,所述关联关系用于表示所述音乐特征与所述第一声源位置的关联。
  25. 根据权利要求20所述的渲染设备,其特征在于,所述参考信息包括所述媒体信息,当所述媒体信息包括所述多媒体文件中需要显示的文字且所述文字包含有与位置相关的位置文字时,所述确定单元,具体用于识别所述位置文字;
    所述确定单元,具体用于基于所述位置文字确定所述第一声源位置。
  26. 根据权利要求21所述的渲染设备,其特征在于,所述参考信息包括参考位置信息, 当所述参考位置信息包括所述第一位置信息时,所述获取单元,还用于获取所述第一位置信息,所述第一位置信息包括所述传感器的第一姿态角以及所述传感器与播放设备之间的距离;
    所述确定单元,具体用于将所述第一位置信息转化为所述第一声源位置。
  27. 根据权利要求21所述的渲染设备,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第一位置信息时,所述获取单元,还用于获取所述第一位置信息,所述第一位置信息包括所述传感器的第二姿态角以及所述传感器的加速度;
    所述确定单元,具体用于将所述第一位置信息转化为所述第一声源位置。
  28. 根据权利要求21所述的渲染设备,其特征在于,所述参考信息包括参考位置信息,当所述参考位置信息包括所述第二位置信息时,所述渲染设备还包括:
    提供单元,用于提供球视图供用户选择,所述球视图的圆心为所述用户所在的位置,所述球视图的半径为所述用户的位置与播放设备的距离;
    所述获取单元,还用于获取用户在所述球视图中选择的所述第二位置信息;
    所述确定单元,具体用于将所述第二位置信息转化为所述第一声源位置。
  29. 根据权利要求19至28任一项所述的渲染设备,其特征在于,所述获取单元,具体用于从所述多媒体文件中的原始音轨中分离出所述第一单对象音轨,所述原始音轨至少由所述第一单对象音轨以及第二单对象音轨合成获取,所述第二单对象音轨与第二发声对象对应。
  30. 根据权利要求29所述的渲染设备,其特征在于,所述获取单元,具体用于通过训练好的分离网络从所述原始音轨中分离出所述第一单对象音轨。
  31. 根据权利要求30所述的渲染设备,其特征在于,所述训练好的分离网络是通过以训练数据作为所述分离网络的输入,以损失函数的值小于第一阈值为目标对分离网络进行训练获取,所述训练数据包括训练音轨,所述训练音轨至少由初始第三单对象音轨以及初始第四单对象音轨合成获取,所述初始第三单对象音轨与第三发声对象对应,所述初始第四单对象音轨与第四发声对象对应,所述第三发声对象与所述第一发声对象的属于相同类型,所述第二发声对象与所述第四发声对象的属于相同类型,所述分离网络的输出包括分离获取的第三单对象音轨;
    所述损失函数用于指示所述分离获取的第三单对象音轨与所述初始第三单对象音轨之间的差异。
  32. 根据权利要求22所述的渲染设备,其特征在于,若所述播放设备为耳机,所述获取单元,具体用于通过如下公式获取所述渲染后的第一单对象音轨;
    Figure PCTCN2022087353-appb-100017
    其中,
    Figure PCTCN2022087353-appb-100018
    为所述渲染后的第一单对象音轨,S为所述多媒体文件的发声对象且包括所述第一发声对象,i指示左声道或右声道,a s(t)为t时刻下所述第一发声对象的调节系数,h i,s(t)为t时刻下的所述第一发声对象对应的所述左声道或所述右声道的头相关传输函数HRTF滤波器系数,所述HRTF滤波器系数与所述第一声源位置相关,o s(t)为所述t时刻下的所述第一单对象音轨,τ为积分项。
  33. 根据权利要求22所述的渲染设备,其特征在于,若所述播放设备为N个外放设备,所述获取单元,具体用于通过如下公式获取所述渲染后的第一单对象音轨;
    Figure PCTCN2022087353-appb-100019
    其中,
    Figure PCTCN2022087353-appb-100020
    其中,
    Figure PCTCN2022087353-appb-100021
    其中,
    Figure PCTCN2022087353-appb-100022
    为所述渲染后的第一单对象音轨,i指示多声道中的第i个声道,S为所述多媒体文件的发声对象且包括所述第一发声对象,a s(t)为t时刻下所述第一发声对象的调节系数,g s(t)代表所述t时刻下所述第一发声对象的平移系数,o s(t)为所述t时刻下的所述第一单对象音轨,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为所述校准器校准所述第i个外放设备所获取的倾斜角,r i为所述第i个外放设备与所述校准器的距离,N为正整数,i为正整数且i≤N,所述第一声源位置在所述N个外放设备构成的四面体内。
  34. 根据权利要求22所述的渲染设备,其特征在于,所述获取单元,还用于基于所述渲染后的第一单对象音轨以及所述多媒体文件中的原始音轨,获取目标音轨;
    所述渲染设备还包括:
    发送单元,用于向所述播放设备发送所述目标音轨,所述播放设备用于播放所述目标音轨。
  35. 根据权利要求34所述的渲染设备,其特征在于,若所述播放设备为耳机,所述获取单元,具体用于通过如下公式获取所述目标音轨:
    Figure PCTCN2022087353-appb-100023
    其中,i指示左声道或右声道,
    Figure PCTCN2022087353-appb-100024
    为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨,
    Figure PCTCN2022087353-appb-100025
    为所述t时刻下未被渲染的所述第一单对象音轨,
    Figure PCTCN2022087353-appb-100026
    为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,h i,s(t)为所述t时刻下所述第一发声对象对应的所述左声道或所 述右声道的头相关传输函数HRTF滤波器系数,所述HRTF滤波器系数与所述第一声源位置相关,o s(t)为所述t时刻下的所述第一单对象音轨,τ为积分项,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制所述原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象。
  36. 根据权利要求34所述的渲染设备,其特征在于,若所述播放设备为N个外放设备,所述获取单元,具体用于通过如下公式获取所述目标音轨:
    Figure PCTCN2022087353-appb-100027
    其中,
    Figure PCTCN2022087353-appb-100028
    Figure PCTCN2022087353-appb-100029
    其中,i指示多声道中的第i个声道,
    Figure PCTCN2022087353-appb-100030
    为t时刻下的所述目标音轨,X i(t)为所述t时刻下的所述原始音轨,
    Figure PCTCN2022087353-appb-100031
    为所述t时刻下未被渲染的所述第一单音轨,
    Figure PCTCN2022087353-appb-100032
    为所述渲染后的第一单对象音轨,a s(t)为所述t时刻下所述第一发声对象的调节系数,g s(t)代表所述t时刻下所述第一发声对象的平移系数,g i,s(t)代表g s(t)中的第i行,o s(t)为所述t时刻下的所述第一单对象音轨,S 1为所述原始音轨中需要被替换的发声对象,若所述第一发声对象是替换所述原始音轨中的发声对象,则S 1为空集;S 2为所述目标音轨相较于所述原始音轨增加的发声对象,若所述第一发声对象是复制所述原始音轨中的发声对象,则S 2为空集;S 1和/或S 2为所述多媒体文件的发声对象且包括所述第一发声对象,λ i为校准器校准第i个外放设备所获取的方位角,Φ i为所述校准器校准所述第i个外放设备所获取的倾斜角,r i为所述第i个外放设备与所述校准器的距离,N为正整数,i为正整数且i≤N,所述第一声源位置在所述N个外放设备构成的四面体内。
  37. 一种渲染设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述渲染设备执行如权利要求1至18中任一项所述的方法。
  38. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至18中任一项所述的方法。
PCT/CN2022/087353 2021-04-29 2022-04-18 一种渲染方法及相关设备 WO2022228174A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023565286A JP2024515736A (ja) 2021-04-29 2022-04-18 レンダリング方法および関連するデバイス
EP22794645.6A EP4294026A1 (en) 2021-04-29 2022-04-18 Rendering method and related device
US18/498,002 US20240064486A1 (en) 2021-04-29 2023-10-30 Rendering method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110477321.0 2021-04-29
CN202110477321.0A CN115278350A (zh) 2021-04-29 2021-04-29 一种渲染方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/498,002 Continuation US20240064486A1 (en) 2021-04-29 2023-10-30 Rendering method and related device

Publications (1)

Publication Number Publication Date
WO2022228174A1 true WO2022228174A1 (zh) 2022-11-03

Family

ID=83745121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087353 WO2022228174A1 (zh) 2021-04-29 2022-04-18 一种渲染方法及相关设备

Country Status (5)

Country Link
US (1) US20240064486A1 (zh)
EP (1) EP4294026A1 (zh)
JP (1) JP2024515736A (zh)
CN (1) CN115278350A (zh)
WO (1) WO2022228174A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106448687A (zh) * 2016-09-19 2017-02-22 中科超影(北京)传媒科技有限公司 音频制作及解码的方法和装置
CN109983786A (zh) * 2016-11-25 2019-07-05 索尼公司 再现装置、再现方法、信息处理装置、信息处理方法以及程序
CN110972053A (zh) * 2019-11-25 2020-04-07 腾讯音乐娱乐科技(深圳)有限公司 构造听音场景的方法和相关装置
CN111526242A (zh) * 2020-04-30 2020-08-11 维沃移动通信有限公司 音频处理方法、装置和电子设备
CN112037738A (zh) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 一种音乐数据的处理方法、装置及计算机存储介质
CN112291615A (zh) * 2020-10-30 2021-01-29 维沃移动通信有限公司 音频输出方法、音频输出装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9980078B2 (en) * 2016-10-14 2018-05-22 Nokia Technologies Oy Audio object modification in free-viewpoint rendering
US10178490B1 (en) * 2017-06-30 2019-01-08 Apple Inc. Intelligent audio rendering for video recording
US10872115B2 (en) * 2018-03-19 2020-12-22 Motorola Mobility Llc Automatically associating an image with an audio track

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106448687A (zh) * 2016-09-19 2017-02-22 中科超影(北京)传媒科技有限公司 音频制作及解码的方法和装置
CN109983786A (zh) * 2016-11-25 2019-07-05 索尼公司 再现装置、再现方法、信息处理装置、信息处理方法以及程序
CN110972053A (zh) * 2019-11-25 2020-04-07 腾讯音乐娱乐科技(深圳)有限公司 构造听音场景的方法和相关装置
CN111526242A (zh) * 2020-04-30 2020-08-11 维沃移动通信有限公司 音频处理方法、装置和电子设备
CN112037738A (zh) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 一种音乐数据的处理方法、装置及计算机存储介质
CN112291615A (zh) * 2020-10-30 2021-01-29 维沃移动通信有限公司 音频输出方法、音频输出装置

Also Published As

Publication number Publication date
JP2024515736A (ja) 2024-04-10
US20240064486A1 (en) 2024-02-22
CN115278350A (zh) 2022-11-01
EP4294026A1 (en) 2023-12-20

Similar Documents

Publication Publication Date Title
WO2020224322A1 (zh) 音乐文件的处理方法、装置、终端及存储介质
JP7283496B2 (ja) 情報処理方法、情報処理装置およびプログラム
US20200374645A1 (en) Augmented reality platform for navigable, immersive audio experience
CN109564504A (zh) 用于基于移动处理空间化音频的多媒体装置
CN111179961A (zh) 音频信号处理方法、装置、电子设备及存储介质
CN112037738B (zh) 一种音乐数据的处理方法、装置及计算机存储介质
US20060179160A1 (en) Orchestral rendering of data content based on synchronization of multiple communications devices
CN108270794B (zh) 内容发布方法、装置及可读介质
EP4336490A1 (en) Voice processing method and related device
CN110322760A (zh) 语音数据生成方法、装置、终端及存储介质
WO2021114808A1 (zh) 音频处理方法、装置、电子设备和存储介质
JP7277611B2 (ja) テキスト類似性を使用した視覚的タグのサウンドタグへのマッピング
WO2023207541A1 (zh) 一种语音处理方法及相关设备
CN113823250B (zh) 音频播放方法、装置、终端及存储介质
CN114073854A (zh) 基于多媒体文件的游戏方法和***
JP2021101252A (ja) 情報処理方法、情報処理装置およびプログラム
CN111428079B (zh) 文本内容处理方法、装置、计算机设备及存储介质
US20220246135A1 (en) Information processing system, information processing method, and recording medium
Heise et al. Soundtorch: Quick browsing in large audio collections
WO2022267468A1 (zh) 一种声音处理方法及其装置
US20200073885A1 (en) Image display apparatus and operation method of the same
EP4252195A1 (en) Real world beacons indicating virtual locations
CN114286275A (zh) 一种音频处理方法及装置、存储介质
CN110087122A (zh) 用于处理信息的***、方法和装置
WO2022228174A1 (zh) 一种渲染方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794645

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022794645

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022794645

Country of ref document: EP

Effective date: 20230914

WWE Wipo information: entry into national phase

Ref document number: 2023565286

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE