WO2017188141A1 - Audio signal processing device, audio signal processing method, and audio signal processing program - Google Patents

Audio signal processing device, audio signal processing method, and audio signal processing program Download PDF

Info

Publication number
WO2017188141A1
WO2017188141A1 PCT/JP2017/016019 JP2017016019W WO2017188141A1 WO 2017188141 A1 WO2017188141 A1 WO 2017188141A1 JP 2017016019 W JP2017016019 W JP 2017016019W WO 2017188141 A1 WO2017188141 A1 WO 2017188141A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
channel
component
target channel
coherent
Prior art date
Application number
PCT/JP2017/016019
Other languages
French (fr)
Japanese (ja)
Inventor
安藤 彰男
Original Assignee
国立大学法人富山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人富山大学 filed Critical 国立大学法人富山大学
Priority to JP2018514561A priority Critical patent/JP6846822B2/en
Publication of WO2017188141A1 publication Critical patent/WO2017188141A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • One aspect of the present invention relates to an audio signal processing device, an audio signal processing method, and an audio signal processing program.
  • a method for changing the number of audio signal channels has been known. Specifically, a method called upmix that converts an M-channel audio signal into an N-channel (where N> M) audio signal, and a method called downmix that converts an N-channel audio signal into an M-channel audio signal Exists. For example, conversion from a 2-channel (left channel and right-channel) audio signal to a 5.1-channel audio signal is an example of upmixing. The conversion from a 5.1 channel audio signal to a 2 channel audio signal is an example of downmixing.
  • Patent Document 1 describes a surround playback device that makes a stereo broadcast of a live sports television / radio program a powerful presence and an easy-to-listen announcement.
  • the apparatus has front left / right channel signal creation means, front center channel signal creation means, and rear left / right surround channel signal creation means.
  • Front left / right channel signal creation means selectively adds reverberant sound to front left / right channel audio signals obtained by performing matrix processing on 2-channel audio signal input, and adjusts front volume And output as audio signals for the front left / right channels.
  • the front center channel signal generation means adjusts and outputs the center volume as a front center channel audio signal without adding a reverberation sound to the audio signal obtained by extracting the in-phase component from the 2-channel audio signal input.
  • the rear left / right surround channel signal creation means adds reverberant sound to the front left / right channel audio signals obtained by performing matrix processing, adjusts the rear volume, and outputs the rear left / right channel audio signals. Output as a signal.
  • Non-Patent Document 1 describes a method of dividing a stereo signal into bands, dividing the stereo signal into a main signal and an ambience signal for each band, and reproducing the ambience signal from the rear channel of 5.1 channels.
  • Non-Patent Document 2 describes a method of dividing a stereo signal into bands and then dividing the stereo signal into a direct sound component and a reverberation sound component and reproducing the reverberation sound component from the side.
  • Non-Patent Documents 3 and 4 each disclose a method of generating an audio signal of three or more channels by dividing a multi-channel audio signal into a pair of two-channel audio signals.
  • Non-Patent Documents 1 and 2 do not add reverberation sound, but can be applied only to two-channel audio signals (that is, stereo signals) in principle.
  • Non-Patent Documents 3 and 4 since a component having a high correlation between two-channel audio signals is extracted as a coherent component, information on a sound located near the middle of the two speakers is acquired. . Therefore, in an audio system with three or more channels, only the sound information near the middle of any two speakers can be extracted as a coherent component, and the sound located in the central part of the area surrounded by all the speakers can be extracted. Information cannot be extracted.
  • An audio signal processing device is a receiving unit that receives audio signals of a plurality of channels, and a dividing unit that executes a dividing process for dividing an audio signal into a coherent component and a field component for each channel.
  • the division processing is performed using one channel that is the target of the division processing as the target channel
  • the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel.
  • An audio signal processing method includes an accepting step in which an audio signal processing device receives audio signals of a plurality of channels, and a division in which the audio signal processing device divides the audio signal into a coherent component and a field component.
  • a division step for performing processing for each channel, and the division processing is calculated using at least an audio signal of a channel other than the target channel when one channel to be divided is set as a target channel. Extracting an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals as a coherent component of the target channel; and calculating a difference between the audio signal of the target channel and the coherent component of the target channel.
  • Extract as field component of And a step comprises the said dividing step, the audio signal processing device, and an output step of outputting coherent component and field component of each channel extracted in dividing step.
  • An audio signal processing program is a reception step for receiving audio signals of a plurality of channels, and a division step for executing division processing for dividing the audio signal into coherent components and field components for each channel.
  • the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel.
  • Split steps and split steps To execute an output step of outputting coherent component and field component of each channel extracted in up to the computer.
  • a signal that is estimated using an audio signal of a channel other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel.
  • This coherent component and field component are obtained for each channel. In this way, by obtaining the coherent component and field component of each channel using only the original audio signal without adding sound, the atmosphere of the original sound can be maintained as much as possible.
  • this method can be applied regardless of the number of channels of the original sound.
  • the atmosphere of the original sound can be maintained as much as possible when the number of channels of the audio signal is changed regardless of the number of channels of the original sound.
  • FIG. 1 It is a figure which shows the example of the audio signal process which concerns on embodiment. It is a figure which shows the hardware constitutions of the computer which functions as the audio signal processing apparatus which concerns on embodiment. It is a figure which shows the function structure of the audio signal processing apparatus which concerns on embodiment. It is a figure which shows the block which is a unit which processes an audio signal. It is a figure which shows the process in a certain channel. It is a flowchart which shows operation
  • the audio signal processing apparatus 10 is a computer that divides each audio signal of a plurality of channels into a coherent component and a field component.
  • the audio signal is a digital signal including sound in a frequency band (generally about 20 Hz to 20000 Hz) that can be heard by humans, and is converted into an analog signal as necessary. Examples of sound represented by the audio signal include, but are not limited to, voice, music, video sound, natural sound, or any combination thereof.
  • FIG. 1 shows an example of audio signal processing by the audio signal processing apparatus 10, and more specifically shows processing of two channels (L channel and R channel), that is, stereo audio signals.
  • the audio signal processing apparatus 10 divides each channel signal into a coherent component and a field component.
  • a coherent component of one channel is a component having a high correlation with an audio signal of another channel.
  • the field component of one channel is the difference between the audio signal of the channel (ie, the original signal) and the coherent component of the channel. More specifically, the field component is a component obtained by subtracting the coherent component from the audio signal.
  • the coherent component is a sound having a clear direction, whereas the field component is an ambient sound having a diffusive nature.
  • the sound corresponding to the field component is also referred to as “field sound”.
  • FIG. 1 shows that an audio signal processing apparatus 10 divides an L channel audio signal into an L channel coherent component L ⁇ and a field component L ⁇ , and an R channel audio signal into an R channel coherent component R ⁇ and a field component R ⁇ .
  • the coherent component L ⁇ is a component having a high correlation with the R channel audio signal
  • the coherent component R ⁇ is a component having a high correlation with the L channel audio signal.
  • FIG. 1 shows the processing of a two-channel audio signal, but the audio signal processing apparatus 10 may process an arbitrary number of audio signals.
  • the audio signal processing apparatus 10 may process audio signals of three or more channels.
  • the audio signal processing apparatus 10 may process 22.2 channel audio signals for 8K Super Hi-Vision.
  • multi-channel audio signals are recorded by a plurality of microphones arranged in a three-dimensional space.
  • the audio signals of a plurality of channels are recorded in such a manner that a plurality of target sounds (object sound) are mixed with each other or the target sound is mixed with a field sound.
  • target sounds object sound
  • the distance from a sound source differs among individual microphones, the time at which a specific sound arrives differs from microphone to microphone, and as a result, the coherence of the recorded audio signal becomes low.
  • the coherent component can be extracted from the audio signal of each channel, the clarity of the sound and the apparent sound source width (ASW: Appearance Source Width) can be improved. Further, by extracting the field component and using it for the upmix, it is possible to produce a good ambience effect (feeling that the sound surrounds the listener).
  • the coherent component corresponds to a target sound (for example, singing voice, instrument sound, or sound emitted from a speaker) emitted from the main sound source, and the field component is a sound whose directionality is not clear (for example, echo, beat). Etc.).
  • field sound v l (n) That is, the audio signal x l (n) is expressed by the equation (1).
  • Equation (2) The coherent component ⁇ l (n) of the audio signal x l (n) is expressed by Equation (2).
  • the field component ⁇ l (n) of the audio signal x l (n) is expressed by Expression (3).
  • the specific method for realizing the audio signal processing apparatus 10 is not limited.
  • the audio signal processing apparatus 10 may be realized by installing a predetermined program (for example, an audio signal processing program P1 described later) in a computer such as a personal computer, a server, or a portable terminal.
  • a predetermined program for example, an audio signal processing program P1 described later
  • an audio device such as an amplifier may function as the audio signal processing device 10.
  • FIG. 2 shows a general hardware configuration of the computer 100 functioning as the audio signal processing apparatus 10.
  • the computer 100 includes a processor (for example, CPU) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk, a flash memory, and the like.
  • a communication control unit 104 configured by a network card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a monitor are provided.
  • Each functional element of the audio signal processing apparatus 10 is realized by reading predetermined software (for example, an audio signal processing program P1 described later) on the processor 101 or the main storage unit 102 and executing the software.
  • the processor 101 operates the communication control unit 104, the input device 105, or the output device 106 in accordance with the software, and reads and writes data in the main storage unit 102 or the auxiliary storage unit 103. Data or a database necessary for processing is stored in the main storage unit 102 or the auxiliary storage unit 103.
  • the audio signal processing apparatus 10 may be composed of one computer or a plurality of computers. When a plurality of computers are used, one audio signal processing apparatus 10 is logically constructed by connecting these computers via a communication network such as the Internet or an intranet.
  • FIG. 3 shows a functional configuration of the audio signal processing apparatus 10.
  • the audio signal processing apparatus 10 includes a receiving unit 11, a dividing unit 12, and an output unit 13 as functional components.
  • the reception unit 11 is a functional element that receives audio signals of a plurality of channels. “Accepting an audio signal” means that the audio signal processing apparatus 10 acquires an audio signal by an arbitrary method. In other words, “accepting an audio signal” means that the audio signal is input to the audio signal processing apparatus 10.
  • a specific method for receiving the audio signal of each channel is not limited.
  • the reception unit 11 may receive an audio signal by accessing a database or another device and reading out a data file of the audio signal. Or the reception part 11 may receive the audio signal sent via the communication network from the other apparatus. Alternatively, the reception unit 11 may acquire an audio signal input from the audio signal processing device 10. In any case, the receiving unit 11 outputs the received audio signal of each channel to the dividing unit 12.
  • the dividing unit 12 is a functional element that divides the audio signal of each channel into a coherent component and a field component. The following description is based on the premise that the dividing unit 12 processes the N-channel audio signal ⁇ x l (n)
  • l 1,..., N ⁇ expressed by Expression (4).
  • the dividing unit 12 divides the audio signal of each channel into a plurality of time interval signals. Specifically, the dividing unit 12 divides the audio signal into signals having a short time interval (referred to as “frame”) using a window function (for example, Kaiser-Bessel window). For example, if 1024 frequency points are used in the modified discrete cosine transform (MDCT) described later, the dividing unit 12 uses a Kaiser-Bessel window corresponding to the length of 2048 points to divide the audio signal into a plurality of frames. To divide. Usually, the number of samples in one frame is determined so as to obtain an appropriate frequency resolution, but the number of samples is not sufficient for estimating the coherent component.
  • a window function for example, Kaiser-Bessel window
  • the dividing unit 12 sets a plurality of continuous frames (for example, 24 frames) as a signal of one time section (referred to as “block”).
  • FIG. 4 shows the concept of such block generation. More specifically, FIG. 4 shows a process of dividing each of two-channel (L channel and R channel) audio signals into a plurality of blocks.
  • the dividing unit 12 executes the following processing for each block of each channel.
  • a channel that is a target for dividing an audio signal into a coherent component and a field component (that is, a target of division processing) is referred to as a “target channel”.
  • processing in a certain target channel will be described.
  • the dividing unit 12 extracts a coherent component of the target channel, and then extracts a field component of the target channel.
  • FIG. 5 shows the concept of extraction of coherent components corresponding to the first half of the series of processes.
  • the dividing unit 12 converts the audio signal x l (n) of the l-th channel, which is the target channel, into K frequency band (subband) signals (referred to as “subband signals”).
  • the dividing unit 12 uses a least square method for this extraction.
  • the dividing unit 12 extracts the coherent components ⁇ l (n) of the target channel by adding the coherent components of all the subbands. Thereafter, the dividing unit 12 extracts the field component ⁇ l (n) by subtracting the coherent component ⁇ l (n) from the original audio signal x l (n).
  • the dividing unit 12 executes the following processing for each block of the audio signal of the target channel.
  • the dividing unit 12 divides the audio signal x l (n) of each channel into K subband signals x l (k) (n) using a filter bank. This division is expressed by equation (5).
  • the audio signal processing apparatus 10 uses time-domain subband signals, and therefore processes a signal with an arbitrary number of consecutive frames as one block signal. By doing so, the estimated section length can be extended. As a result, the audio signal of each channel can be processed without impairing the sound quality of the obtained coherent component.
  • the dividing unit 12 converts the subband signal x l (k) (n) into subband signals ⁇ x m (k) (n) in the same band (same subband) of N ⁇ 1 channels other than the target channel.
  • n) Estimate from a linear combination of
  • m 1,..., l ⁇ 1, l + 1,. This linear combination corresponding to a certain block is expressed by Equation (6).
  • Estimated signal Can be considered as a component having a high correlation with signals in the same band of other channels (N ⁇ 1 channels other than the target channel).
  • An estimation error e l (k) (n) between the subband signal of the target channel and the estimated signal is expressed by Expression (7).
  • the dividing unit 12 obtains coefficients ⁇ a m (k)
  • m 1,..., L ⁇ 1, l + 1,..., N ⁇ that minimize the estimation error by the least square method.
  • the error function to be minimized is given by equation (8).
  • the coherent component ⁇ l (k) (n) of the target channel in the kth subband is obtained by Equation (12).
  • This coherent component ⁇ l (k) (n) corresponds to an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals calculated using the audio signals of channels other than the target channel.
  • the dividing unit 12 obtains coherent components for all subbands. Then, the dividing unit 12 obtains the coherent component of the target channel by adding the coherent components of all the subbands. This process is expressed by equation (13).
  • the dividing unit 12 obtains the field component of the target channel by subtracting the coherent component from the original audio signal of the target channel. This processing is expressed by the above formula (3).
  • the dividing unit 12 may obtain a field component by subtracting a coherent component from the audio signal in each subband, and may obtain a field component of the target channel by adding the field components of all subbands. Specifically, the field component ⁇ l (k) (n) of the target channel in the k-th subband is obtained by Expression (14). The field component ⁇ l (n) of the target channel is obtained by Equation (15).
  • the dividing unit 12 performs the above processing on each block of the audio signal of the target channel. Then, the dividing unit 12 extracts the coherent component of the target channel by connecting the coherent components of all blocks. Further, the dividing unit 12 generates the field component of the target channel by concatenating the field components of all blocks.
  • the dividing unit 12 generates a coherent component and a field component for all channels by setting each of a plurality of channels as a target channel and executing the above processing. Then, the division unit 12 outputs the coherent components and field components of all channels to the output unit 13.
  • the dividing unit 12 does not add another signal to the audio signal of each channel (that is, without adding another sound to the original sound), and converts the audio signal of each channel into a coherent component and a field component. To divide.
  • the output unit 13 is a functional element that outputs the coherent component and field component of each channel generated by the dividing unit 12 as a processing result.
  • This processing result can be said to be an upmix from N channel to 2N channel.
  • the output method of the processing result is not limited at all.
  • the output unit 13 may store the processing result in a storage device such as a memory or a database, or may transmit the processing result to another device via a communication network.
  • the output unit 13 may output the coherent component and field component of each channel to a corresponding speaker.
  • existing audio material can be used for production of contents having a larger number of channels, or reproduced by an audio system having a larger number of channels. It becomes possible to do.
  • the audio signal processing apparatus 10 may upmix an N-channel audio signal into a number of channels larger than 2N. Specifically, the audio signal processing apparatus 10 generates signals having different correlations between channels by decorrelating the extracted plurality of field components using a technique described in the following reference. Thereby, more than N field components are obtained. For example, stereo audio material can be converted into 5.1 channel audio material, and can be reproduced with higher presence using a 5.1 channel audio system. Alternatively, 5.1 channel audio material can be converted to 22.2 channel audio material, or reproduced with higher presence using a 22.2 channel audio system. (Reference) J. Breebaart and C. Fallar, “Spatial Audio Processing-MPEG Surround and Other Applications,” Wiley, 2007.
  • the audio signal processing apparatus 10 may upmix the N-channel audio signal into audio signals of J audio signals smaller than 2N (where J> N). Specifically, the audio signal processing apparatus 10 realizes an upmix from the N channel to the J channel by mixing N field components.
  • the processing result by the audio signal processing apparatus 10 can be used not only for upmixing but also for downmixing.
  • the reception unit 11 receives audio signals of a plurality of channels (reception step).
  • the dividing unit 12 executes a dividing process for dividing each audio signal into a coherent component and a field component for each channel (dividing step).
  • the output unit 13 outputs the coherent component and field component of each channel (output step).
  • dividing step a particularly important process of the dividing unit 12 will be described in detail.
  • FIG. 6 shows a process of generating a coherent component and a field component of one target channel.
  • the dividing unit 12 divides the audio signal of each channel into a plurality of blocks (step S11). Note that by storing the audio signal of each channel and each block divided in step S11, step S11 can be omitted when processing the second and subsequent target channels.
  • the dividing unit 12 sets one of a plurality of blocks of the target channel as a processing target (step S12). Subsequently, the dividing unit 12 extracts an estimated signal having the highest correlation with the audio signal of the target channel from among the estimated signals calculated using the audio signals of channels other than the target channel as a coherent component of the target channel. (Step S13). Subsequently, the dividing unit 12 extracts a difference between the audio signal of the target channel and the coherent component thereof as a field component of the target channel (step S14). By such processing, the dividing unit 12 obtains a coherent component and a field component of one block of the target channel.
  • the process proceeds to the next block (see step S15). That is, the dividing unit 12 sets the next block as a processing target (step S12), and generates a coherent component and a field component of the block (steps S13 and S14).
  • the dividing unit 12 executes the processing of steps S12 to S14 for all blocks, and generates coherent components and field components of all blocks (YES in step S15). Then, the dividing unit 12 obtains the final coherent component of the target channel by concatenating the coherent components of all blocks, and obtains the final field component of the target channel by concatenating the field components of all blocks.
  • FIG. 7 shows details of the processing in step S13 in FIG. 6, that is, details of processing for generating a coherent component of the target channel.
  • the process shown in FIG. 7 is executed for each block of the audio signal of the target channel.
  • the dividing unit 12 generates a plurality of subband signals by dividing the block signal into a plurality of subbands for each channel (target channel and all other channels) (step S131). Subsequently, the dividing unit 12 sets one of a plurality of subbands as a processing target (step S132). Subsequently, the dividing unit 12 selects an estimated signal having the highest correlation with the subband signal of the target channel from among the estimated signals calculated using the subband signals of channels other than the target channel. As a coherent component of the target channel at (step S133). The dividing unit 12 executes the processes of steps S132 and S133 for all subbands (see step S134).
  • the dividing unit 12 adds the coherent components to add the coherent components of the target channel (more specifically, the coherent components for one block). ) Is generated (step S135).
  • the audio signal processing program P1 includes a main module P10, a reception module P11, a division module P12, and an output module P13.
  • the main module P10 is a part that performs overall processing of audio signals.
  • the functions realized by executing the reception module P11, the division module P12, and the output module P13 are the same as the functions of the reception unit 11, the division unit 12, and the output unit 13, respectively.
  • the audio signal processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Alternatively, the audio signal processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.
  • the audio signal processing device executes a reception unit that receives audio signals of a plurality of channels and a division process that divides the audio signal into coherent components and field components for each channel.
  • a dividing unit that performs division processing when the target channel is one channel that is the target of the division processing, the target of the estimated signals calculated using at least audio signals of channels other than the target channel An estimation signal having the highest correlation with the audio signal of the channel is extracted as a coherent component of the target channel, and a difference between the audio signal of the target channel and the coherent component of the target channel is extracted as a field component of the target channel.
  • Including a step, and the dividing unit and the dividing unit And an output unit for outputting the coherent component and field component of each issued channels.
  • An audio signal processing method includes an accepting step in which an audio signal processing device receives audio signals of a plurality of channels, and a division in which the audio signal processing device divides the audio signal into a coherent component and a field component.
  • a division step for performing processing for each channel, and the division processing is calculated using at least an audio signal of a channel other than the target channel when one channel to be divided is set as a target channel. Extracting an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals as a coherent component of the target channel; and calculating a difference between the audio signal of the target channel and the coherent component of the target channel.
  • Extract as field component of And a step comprises the said dividing step, the audio signal processing device, and an output step of outputting coherent component and field component of each channel extracted in dividing step.
  • An audio signal processing program is a reception step for receiving audio signals of a plurality of channels, and a division step for executing division processing for dividing the audio signal into coherent components and field components for each channel.
  • the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel.
  • Split steps and split steps To execute an output step of outputting coherent component and field component of each channel extracted in up to the computer.
  • a signal that is estimated using an audio signal of a channel other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel.
  • This coherent component and field component are obtained for each channel. In this way, the coherent and field components of each channel are determined using only the original audio signal without adding sound, so that the atmosphere of the original sound (for example, the original tone) is maintained as completely or completely as possible. Can do.
  • the coherent component and the field component can be obtained by the number of original channels, this method can be applied regardless of the number of channels of the original sound. For example, one aspect of the present invention can be applied to audio signals having an arbitrary number of channels such as 2 channels, 3 channels, 5.1 channels, and 22.2 channels.
  • FIG. 9 is a diagram illustrating an example of extraction of a coherent component in a conventional method
  • FIG. 10 is a diagram illustrating an example of extraction of a coherent component in the above-described aspect.
  • 9 and 10 both show an example in which audio signals are output from three speakers 90 arranged in a triangular shape, and thus this example shows a three-channel audio system.
  • a component having a high correlation between two-channel audio signals is extracted as a coherent component 91 (note that a broken line 92 indicates a field component) ). Therefore, in such a conventional method, only the information of the sound located in the middle portion 93 of the two speakers (channels) 90 can be acquired, and the central portion of the region surrounded by the three speakers (channels) 90 Information on the sound located at 94 cannot be extracted.
  • the coherent component of one speaker (channel) 90 is estimated from the signal of another speaker (channel) 90. Therefore, as shown in FIG. 10, it is possible to extract information on the sound located in the central portion 95 of the area surrounded by the three speakers (channels) 90.
  • This central portion 95 may correspond to the sum of the portions 93 and 94 in FIG.
  • the dividing process performs a process of dividing the audio signal into a plurality of frames using a window function for each channel, and combines at least two consecutive frames into one block.
  • the process of generating a plurality of blocks by executing the process on the whole of the plurality of frames may be executed for each channel, and the step of extracting the coherent component of the target channel in each of the blocks may be included.
  • the number of samples for estimating the coherent component increases, so that the coherent component can be extracted with higher accuracy.
  • the dividing unit divides the audio signal of each channel into a plurality of subbands, thereby generating a plurality of subband signals for each channel; Extracting the coherent component of the target channel in each, and extracting the coherent component of the target channel by adding the coherent components in a plurality of subbands may be included.
  • a coherent component can be extracted according to the accuracy required in each frequency band, and thus a coherent component and a field component can be extracted with high accuracy.
  • Table 1 Seven stereo sound materials (that is, 2-channel audio signals) shown in Table 1 were prepared. All audio materials were obtained from commercially available CDs, and the sampling frequency was 44.1 kHz.
  • the name column in Table 1 shows the song name or the type of song, and the explanation column shows the form of performance. “Artifical” in the mixing column indicates that the material has been subjected to mixing processing, and “Natural” indicates that the material has not been subjected to mixing processing.
  • the length column shows the playback time.
  • a superposition addition method using a modified discrete cosine transform was employed.
  • the Kaiser-Bessel window was used as a window function for dividing the audio signal into a plurality of frames.
  • the frame length is 2048 points, which means that 1024 frequency points are obtained in MDCT.
  • the frequency points were grouped into 23 subbands as shown in Table 2. These subbands are a collection of 69 subbands in a 48 kHz long FFT (Fast Fourier Transform), one for every three consecutive subbands, with reference to the MPEG-2 AAC standard. 24 frames were taken as one block. If the sampling frequency was 44.1 kHz, the block length was equivalent to 0.58 seconds.
  • FFT Fast Fourier Transform
  • Table 3 shows cross-correlation coefficients of the original sound, the coherent component, and the field component.
  • the coherent component showed higher cross-correlation than the original sound.
  • Such a coherent component provides a sound field atmosphere narrower than the original sound.
  • the field component showed a negative cross-correlation except for one material (“Quiet Night”). If a field component showing a negative cross-correlation is reproduced by a speaker installed on the side or rear, a good ambience effect can be obtained. As a result, it is possible to reproduce a sound with a high presence.
  • the dividing unit 12 estimates a coherent component of a certain target channel using an audio signal of a channel other than the target channel.
  • the dividing unit estimates the coherent component of the target channel using the audio signal of the other channel and at least one of the past audio signal of the target channel and the past audio signal of the other channel.
  • the “past audio signal” is an audio signal of a block temporally preceding the block to be processed.
  • the procedure of the audio signal processing method executed by at least one processor is not limited to the example in the above embodiment.
  • the audio signal processing apparatus may omit some of the steps (processes) described above, or may execute the steps in a different order. Also, any two or more of the steps described above may be combined, or a part of the steps may be corrected or deleted. Alternatively, the audio signal processing apparatus may execute other steps in addition to the above steps.
  • the audio signal processing apparatus may use either of the two criteria “greater than” and “greater than” when comparing the magnitude relationship between the two values, and the two criteria “less than” and “less than”. Either of these may be used.
  • the selection of such a standard does not change the technical significance of the process of comparing the magnitude relationship between two numerical values.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

An audio signal processing device according to one embodiment is provided with an acceptance unit for accepting audio signals of a plurality of channels, a division unit for dividing the audio signal of each channel into a coherent component and a field component, and an output unit for outputting the coherent component and field component of each channel. In the dividing process, an estimated signal having highest correlation with the audio signal of a channel to be processed among estimated signals calculated using at least the audio signal of other than the channel to be processed, is extracted as the coherent component of the channel to be processed. Then, a difference of the audio signal of the channel to be processed and the coherent component is extracted as a field component.

Description

オーディオ信号処理装置、オーディオ信号処理方法、およびオーディオ信号処理プログラムAudio signal processing apparatus, audio signal processing method, and audio signal processing program
 本発明の一側面は、オーディオ信号処理装置、オーディオ信号処理方法、およびオーディオ信号処理プログラムに関する。 One aspect of the present invention relates to an audio signal processing device, an audio signal processing method, and an audio signal processing program.
 オーディオ信号のチャネル数を変更する手法が従来から知られている。具体的には、Mチャネルのオーディオ信号をNチャネル(ただし、N>M)のオーディオ信号に変換するアップミックスという手法と、Nチャネルのオーディオ信号をMチャネルのオーディオ信号に変換するダウンミックスという手法が存在する。例えば、2チャネル(左チャネルおよび右チャネル)のオーディオ信号から5.1チャネルのオーディオ信号への変換はアップミックスの一例である。また、5.1チャネルのオーディオ信号から2チャネルのオーディオ信号への変換はダウンミックスの一例である。 A method for changing the number of audio signal channels has been known. Specifically, a method called upmix that converts an M-channel audio signal into an N-channel (where N> M) audio signal, and a method called downmix that converts an N-channel audio signal into an M-channel audio signal Exists. For example, conversion from a 2-channel (left channel and right-channel) audio signal to a 5.1-channel audio signal is an example of upmixing. The conversion from a 5.1 channel audio signal to a 2 channel audio signal is an example of downmixing.
 例えば下記特許文献1には、テレビ・ラジオのスポーツ実況番組のステレオ放送を、迫力ある臨場感と聴き取りやすいアナウンスとするサラウンド再生装置が記載されている。この装置はフロント左/右チャンネル信号創成手段、フロントセンタチャンネル信号創成手段、およびリア左/右サラウンドチャンネル信号創成手段を有する。フロント左/右チャンネル信号創成手段は、2チャンネル音声信号入力に対して、マトリックス処理を行って得たフロント左/右チャンネル用各音声信号に、残響音を選択的に付加すると共にフロント用音量調整を行い、フロント左/右チャンネル用各音声信号として出力する。フロントセンタチャンネル信号創成手段は、2チャンネル音声信号入力から、同相成分を抽出して得た音声信号に、残響音を付加せずにフロントセンタチャンネル用音声信号としてセンタ用音量調整を行って出力する。リア左/右サラウンドチャンネル信号創成手段は、マトリックス処理を行って得たフロント左/右チャンネル用各音声信号に、残響音を付加すると共にリア用音量調整を行い、リア左/右チャンネル用各音声信号として出力する。 For example, the following Patent Document 1 describes a surround playback device that makes a stereo broadcast of a live sports television / radio program a powerful presence and an easy-to-listen announcement. The apparatus has front left / right channel signal creation means, front center channel signal creation means, and rear left / right surround channel signal creation means. Front left / right channel signal creation means selectively adds reverberant sound to front left / right channel audio signals obtained by performing matrix processing on 2-channel audio signal input, and adjusts front volume And output as audio signals for the front left / right channels. The front center channel signal generation means adjusts and outputs the center volume as a front center channel audio signal without adding a reverberation sound to the audio signal obtained by extracting the in-phase component from the 2-channel audio signal input. . The rear left / right surround channel signal creation means adds reverberant sound to the front left / right channel audio signals obtained by performing matrix processing, adjusts the rear volume, and outputs the rear left / right channel audio signals. Output as a signal.
 下記非特許文献1,2はいずれも、アップミックスの手法を記載する文献である。非特許文献1には、ステレオ信号を帯域分割し、帯域ごとにステレオ信号を主信号とアンビエンス信号とに分割し、アンビエンス信号を5.1チャネルの後方チャネルから再生する手法が記載されている。非特許文献2には、ステレオ信号を帯域分割した後に、そのステレオ信号を直接音成分と残響音成分とに分割し、残響音成分を側方から再生する方法が記載されている。 The following non-patent documents 1 and 2 are documents that describe the upmix technique. Non-Patent Document 1 describes a method of dividing a stereo signal into bands, dividing the stereo signal into a main signal and an ambience signal for each band, and reproducing the ambience signal from the rear channel of 5.1 channels. Non-Patent Document 2 describes a method of dividing a stereo signal into bands and then dividing the stereo signal into a direct sound component and a reverberation sound component and reproducing the reverberation sound component from the side.
 下記非特許文献3,4はいずれも、多チャネルのオーディオ信号を2チャネルのオーディオ信号のペアに分割することで、3チャネル以上のオーディオ信号を生成する手法を開示する。 The following Non-Patent Documents 3 and 4 each disclose a method of generating an audio signal of three or more channels by dividing a multi-channel audio signal into a pair of two-channel audio signals.
特開2007-28065号公報JP 2007-28065 A
 特許文献1に記載のサラウンド再生装置は原音に残響音を付加するため、再生音の雰囲気(例えば音色)が原音から変わったり損なわれたりしてしまう。これに対して非特許文献1,2に記載の手法は残響音を付加するものではないが、原理上、2チャネルのオーディオ信号(すなわち、ステレオ信号)にしか適用できない。 Since the surround playback apparatus described in Patent Document 1 adds a reverberation sound to the original sound, the atmosphere (for example, timbre) of the playback sound is changed or damaged from the original sound. On the other hand, the methods described in Non-Patent Documents 1 and 2 do not add reverberation sound, but can be applied only to two-channel audio signals (that is, stereo signals) in principle.
 非特許文献3,4に記載の手法では、2チャネルのオーディオ信号の間で相関が高い成分をコヒーレント成分として抽出するので、二つのスピーカの中間付近に位置する音の情報を取得することになる。したがって、3チャネル以上のオーディオ・システムでは、任意の二つのスピーカの中間付近の音の情報だけしかコヒーレント成分として抽出することができず、全スピーカで囲まれた領域の中央部分に位置する音の情報を抽出することができない。 In the methods described in Non-Patent Documents 3 and 4, since a component having a high correlation between two-channel audio signals is extracted as a coherent component, information on a sound located near the middle of the two speakers is acquired. . Therefore, in an audio system with three or more channels, only the sound information near the middle of any two speakers can be extracted as a coherent component, and the sound located in the central part of the area surrounded by all the speakers can be extracted. Information cannot be extracted.
 そこで、原音のチャネル数にかかわらず、オーディオ信号のチャネル数を変更する際に原音の雰囲気を可能な限り維持する手法が望まれている。 Therefore, there is a demand for a technique for maintaining the atmosphere of the original sound as much as possible when changing the number of channels of the audio signal regardless of the number of channels of the original sound.
 本発明の一側面に係るオーディオ信号処理装置は、複数のチャネルのオーディオ信号を受け付ける受付部と、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割部と、分割部により抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力部とを備える。 An audio signal processing device according to an aspect of the present invention is a receiving unit that receives audio signals of a plurality of channels, and a dividing unit that executes a dividing process for dividing an audio signal into a coherent component and a field component for each channel. In the case where the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel. Divider and each channel extracted by the divider And an output unit for outputting the coherent component and field component.
 本発明の一側面に係るオーディオ信号処理方法は、オーディオ信号処理装置が、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号処理装置が、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、オーディオ信号処理装置が、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとを含む。 An audio signal processing method according to an aspect of the present invention includes an accepting step in which an audio signal processing device receives audio signals of a plurality of channels, and a division in which the audio signal processing device divides the audio signal into a coherent component and a field component. A division step for performing processing for each channel, and the division processing is calculated using at least an audio signal of a channel other than the target channel when one channel to be divided is set as a target channel. Extracting an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals as a coherent component of the target channel; and calculating a difference between the audio signal of the target channel and the coherent component of the target channel. Extract as field component of And a step comprises the said dividing step, the audio signal processing device, and an output step of outputting coherent component and field component of each channel extracted in dividing step.
 本発明の一側面に係るオーディオ信号処理プログラムは、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとをコンピュータに実行させる。 An audio signal processing program according to one aspect of the present invention is a reception step for receiving audio signals of a plurality of channels, and a division step for executing division processing for dividing the audio signal into coherent components and field components for each channel. In the case where the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel. Split steps and split steps To execute an output step of outputting coherent component and field component of each channel extracted in up to the computer.
 このような側面においては、対象チャネル以外のチャネルのオーディオ信号を用いて推定され、且つ該対象チャネルの実際のオーディオ信号ごとの相関が最も高い信号が該対象チャネルのコヒーレント成分として抽出される。また、対象チャネルの実際のオーディオ信号とそのコヒーレント成分との差分が該対象チャネルのフィールド成分として抽出される。このコヒーレント成分およびフィールド成分は各チャネルについて得られる。このように、音を追加することなく元のオーディオ信号のみを用いて各チャネルのコヒーレント成分およびフィールド成分を求めることで、原音の雰囲気を可能な限り維持することができる。加えて、コヒーレント成分およびフィールド成分は元のチャネル数の分だけ求めることができるので、この手法は原音のチャネル数にかかわらず適用できる。 In such an aspect, a signal that is estimated using an audio signal of a channel other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel. This coherent component and field component are obtained for each channel. In this way, by obtaining the coherent component and field component of each channel using only the original audio signal without adding sound, the atmosphere of the original sound can be maintained as much as possible. In addition, since the coherent component and the field component can be obtained by the number of original channels, this method can be applied regardless of the number of channels of the original sound.
 本発明の一側面によれば、原音のチャネル数にかかわらず、オーディオ信号のチャネル数を変更する際に原音の雰囲気を可能な限り維持することができる。 According to one aspect of the present invention, the atmosphere of the original sound can be maintained as much as possible when the number of channels of the audio signal is changed regardless of the number of channels of the original sound.
実施形態に係るオーディオ信号処理の例を示す図である。It is a figure which shows the example of the audio signal process which concerns on embodiment. 実施形態に係るオーディオ信号処理装置として機能するコンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the computer which functions as the audio signal processing apparatus which concerns on embodiment. 実施形態に係るオーディオ信号処理装置の機能構成を示す図である。It is a figure which shows the function structure of the audio signal processing apparatus which concerns on embodiment. オーディオ信号を処理する単位であるブロックを示す図である。It is a figure which shows the block which is a unit which processes an audio signal. ある一つのチャネルにおける処理を示す図である。It is a figure which shows the process in a certain channel. 実施形態に係るオーディオ信号処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio signal processing apparatus which concerns on embodiment. 図6に示すコヒーレント成分の抽出の詳細を示すフローチャートである。It is a flowchart which shows the detail of extraction of the coherent component shown in FIG. 実施形態に係るオーディオ信号処理プログラムの構成を示す図である。It is a figure which shows the structure of the audio signal processing program which concerns on embodiment. 従来の手法におけるコヒーレント成分の抽出の例を示す図である。It is a figure which shows the example of extraction of the coherent component in the conventional method. 実施形態におけるコヒーレント成分の抽出の例を示す図である。It is a figure which shows the example of extraction of the coherent component in embodiment.
 以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一または同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.
 図1~図5を参照しながら、実施形態に係るオーディオ信号処理装置10の機能および構成を説明する。オーディオ信号処理装置10は、複数のチャネルのオーディオ信号のそれぞれをコヒーレント成分とフィールド成分とに分割するコンピュータである。オーディオ信号は、ヒトが聴くことができる周波数帯域(一般に約20Hz~20000Hz)の音を含むデジタル信号であり、必要に応じてアナログ信号に変換される。オーディオ信号で示される音の例として声、音楽、映像の音、自然音、あるいはこれらの任意の組合せが挙げられるが、これらに限定されるものではない。 The function and configuration of the audio signal processing apparatus 10 according to the embodiment will be described with reference to FIGS. The audio signal processing apparatus 10 is a computer that divides each audio signal of a plurality of channels into a coherent component and a field component. The audio signal is a digital signal including sound in a frequency band (generally about 20 Hz to 20000 Hz) that can be heard by humans, and is converted into an analog signal as necessary. Examples of sound represented by the audio signal include, but are not limited to, voice, music, video sound, natural sound, or any combination thereof.
 図1は、オーディオ信号処理装置10によるオーディオ信号の処理の一例を示し、より具体的には、2チャネル(LチャネルおよびRチャネル)、すなわちステレオのオーディオ信号の処理を示す。オーディオ信号処理装置10は各チャネルの信号をコヒーレント成分とフィールド成分とに分割する。 FIG. 1 shows an example of audio signal processing by the audio signal processing apparatus 10, and more specifically shows processing of two channels (L channel and R channel), that is, stereo audio signals. The audio signal processing apparatus 10 divides each channel signal into a coherent component and a field component.
 ある一つのチャネルのコヒーレント成分とは、他のチャネルのオーディオ信号との相関が高い成分である。ある一つのチャネルのフィールド成分とは、該チャネルのオーディオ信号(すなわち、元の信号)と該チャネルのコヒーレント成分との差分である。より具体的には、フィールド成分はオーディオ信号からコヒーレント成分を差し引くことで得られる成分である。コヒーレント成分は明瞭な方向性を有する音であるのに対して、フィールド成分は、拡散性を持つ、周囲を取り巻くような音(ambient sound)である。以下では、フィールド成分に対応する音を「フィールド音」ともいう。 A coherent component of one channel is a component having a high correlation with an audio signal of another channel. The field component of one channel is the difference between the audio signal of the channel (ie, the original signal) and the coherent component of the channel. More specifically, the field component is a component obtained by subtracting the coherent component from the audio signal. The coherent component is a sound having a clear direction, whereas the field component is an ambient sound having a diffusive nature. Hereinafter, the sound corresponding to the field component is also referred to as “field sound”.
 図1は、オーディオ信号処理装置10がLチャネルのオーディオ信号をLチャネルのコヒーレント成分Lγおよびフィールド成分Lφに分割し、Rチャネルのオーディオ信号をRチャネルのコヒーレント成分Rγおよびフィールド成分Rφに分割することを示す。コヒーレント成分LγはRチャネルのオーディオ信号との相関が高い成分であり、コヒーレント成分RγはLチャネルのオーディオ信号との相関が高い成分である。 FIG. 1 shows that an audio signal processing apparatus 10 divides an L channel audio signal into an L channel coherent component Lγ and a field component Lφ, and an R channel audio signal into an R channel coherent component Rγ and a field component Rφ. Indicates. The coherent component Lγ is a component having a high correlation with the R channel audio signal, and the coherent component Rγ is a component having a high correlation with the L channel audio signal.
 図1は2チャネルのオーディオ信号の処理を示すが、オーディオ信号処理装置10は任意の個数のオーディオ信号を処理してよい。オーディオ信号処理装置10は3以上のチャネルのオーディオ信号を処理してもよく、例えば、8Kスーパーハイビジョン用の22.2チャネルのオーディオ信号を処理してもよい。 FIG. 1 shows the processing of a two-channel audio signal, but the audio signal processing apparatus 10 may process an arbitrary number of audio signals. The audio signal processing apparatus 10 may process audio signals of three or more channels. For example, the audio signal processing apparatus 10 may process 22.2 channel audio signals for 8K Super Hi-Vision.
 三次元空間での音の方向、距離、広がりを再現可能な立体音響効果を実現するために、複数チャネルのオーディオ信号は、三次元空間内に分散して配置された複数のマイクにより記録される。複数チャネルのオーディオ信号は、複数の目的音(object sound)が互いに混ざったり目的音がフィールド音と混ざったりしたかたちで記録される。一般に音源からの距離は個々のマイクで異なるため、ある特定の音が到着する時間はマイク毎に異なり、その結果、記録されたオーディオ信号のコヒーレントが低くなる。コヒーレント成分を各チャネルのオーディオ信号から取り出すことができれば、音の明瞭性および見かけの音源の幅(ASW:Apparent Source Width)を改善することができる。また、フィールド成分を抽出してこれをアップミックスに用いることで、良好なアンビエンス効果(聴取者の周囲を音が取り巻くような感じ)を生み出すことが可能になる。一般に、コヒーレント成分は主たる音源から発せられる目的音(例えば、歌声、楽器の音、スピーカから発せられる音など)に相当し、フィールド成分は、音の方向性が明瞭でない音(例えば、エコー、うなりなど)に相当する。 In order to achieve a three-dimensional sound effect that can reproduce the direction, distance, and spread of sound in a three-dimensional space, multi-channel audio signals are recorded by a plurality of microphones arranged in a three-dimensional space. . The audio signals of a plurality of channels are recorded in such a manner that a plurality of target sounds (object sound) are mixed with each other or the target sound is mixed with a field sound. In general, since the distance from a sound source differs among individual microphones, the time at which a specific sound arrives differs from microphone to microphone, and as a result, the coherence of the recorded audio signal becomes low. If the coherent component can be extracted from the audio signal of each channel, the clarity of the sound and the apparent sound source width (ASW: Appearance Source Width) can be improved. Further, by extracting the field component and using it for the upmix, it is possible to produce a good ambience effect (feeling that the sound surrounds the listener). In general, the coherent component corresponds to a target sound (for example, singing voice, instrument sound, or sound emitted from a speaker) emitted from the main sound source, and the field component is a sound whose directionality is not clear (for example, echo, beat). Etc.).
 N個のチャネルのうちl番目のチャネルのオーディオ信号をx(n)とすると、このオーディオ信号x(n)はM個の目的音qlm(n)(m=1,…,M)とフィールド音v(n)とから成る。すなわち、オーディオ信号x(n)は式(1)で示される。
Figure JPOXMLDOC01-appb-M000001
Assuming that the audio signal of the l-th channel among the N channels is x l (n), the audio signal x l (n) is M target sounds q lm (n) (m = 1,..., M). And field sound v l (n). That is, the audio signal x l (n) is expressed by the equation (1).
Figure JPOXMLDOC01-appb-M000001
 この式(1)で示されるように、目的音とフィールド音とは互いに統計的に独立と見なすことができる。オーディオ信号x(n)のコヒーレント成分γ(n)は式(2)で示される。
Figure JPOXMLDOC01-appb-M000002
As shown in the equation (1), the target sound and the field sound can be regarded as being statistically independent from each other. The coherent component γ l (n) of the audio signal x l (n) is expressed by Equation (2).
Figure JPOXMLDOC01-appb-M000002
 オーディオ信号x(n)のフィールド成分φ(n)は式(3)で示される。
Figure JPOXMLDOC01-appb-M000003
The field component φ l (n) of the audio signal x l (n) is expressed by Expression (3).
Figure JPOXMLDOC01-appb-M000003
 オーディオ信号処理装置10の具体的な実現方法は限定されない。例えば、オーディオ信号処理装置10はパーソナル・コンピュータ、サーバ、携帯端末などのコンピュータに所定のプログラム(例えば、後述するオーディオ信号処理プログラムP1)をインストールすることで実現されてもよい。あるいは、アンプなどの音響機器がオーディオ信号処理装置10として機能してもよい。 The specific method for realizing the audio signal processing apparatus 10 is not limited. For example, the audio signal processing apparatus 10 may be realized by installing a predetermined program (for example, an audio signal processing program P1 described later) in a computer such as a personal computer, a server, or a portable terminal. Alternatively, an audio device such as an amplifier may function as the audio signal processing device 10.
 図2は、オーディオ信号処理装置10として機能するコンピュータ100の一般的なハードウェア構成を示す。コンピュータ100は、オペレーティングシステムやアプリケーション・プログラムなどを実行するプロセッサ(例えばCPU)101と、ROMおよびRAMで構成される主記憶部102と、ハードディスクやフラッシュメモリなどで構成される補助記憶部103と、ネットワークカードまたは無線通信モジュールで構成される通信制御部104と、キーボードやマウスなどの入力装置105と、モニタなどの出力装置106とを備える。 FIG. 2 shows a general hardware configuration of the computer 100 functioning as the audio signal processing apparatus 10. The computer 100 includes a processor (for example, CPU) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk, a flash memory, and the like. A communication control unit 104 configured by a network card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a monitor are provided.
 オーディオ信号処理装置10の各機能要素は、プロセッサ101または主記憶部102の上に所定のソフトウェア(例えば、後述するオーディオ信号処理プログラムP1)を読み込ませてそのソフトウェアを実行させることで実現される。プロセッサ101はそのソフトウェアに従って、通信制御部104、入力装置105、または出力装置106を動作させ、主記憶部102または補助記憶部103におけるデータの読み出し及び書き込みを行う。処理に必要なデータまたはデータベースは主記憶部102または補助記憶部103内に格納される。 Each functional element of the audio signal processing apparatus 10 is realized by reading predetermined software (for example, an audio signal processing program P1 described later) on the processor 101 or the main storage unit 102 and executing the software. The processor 101 operates the communication control unit 104, the input device 105, or the output device 106 in accordance with the software, and reads and writes data in the main storage unit 102 or the auxiliary storage unit 103. Data or a database necessary for processing is stored in the main storage unit 102 or the auxiliary storage unit 103.
 なお、オーディオ信号処理装置10は1台のコンピュータで構成されてもよいし、複数台のコンピュータで構成されてもよい。複数台のコンピュータを用いる場合には、これらのコンピュータがインターネットやイントラネットなどの通信ネットワークを介して接続されることで、論理的に一つのオーディオ信号処理装置10が構築される。 Note that the audio signal processing apparatus 10 may be composed of one computer or a plurality of computers. When a plurality of computers are used, one audio signal processing apparatus 10 is logically constructed by connecting these computers via a communication network such as the Internet or an intranet.
 図3は、オーディオ信号処理装置10の機能構成を示す。図3に示すように、オーディオ信号処理装置10は機能的構成要素として受付部11、分割部12、および出力部13を備える。 FIG. 3 shows a functional configuration of the audio signal processing apparatus 10. As shown in FIG. 3, the audio signal processing apparatus 10 includes a receiving unit 11, a dividing unit 12, and an output unit 13 as functional components.
 受付部11は、複数のチャネルのオーディオ信号を受け付ける機能要素である。「オーディオ信号を受け付ける」とは、オーディオ信号処理装置10がオーディオ信号を任意の手法で取得することである。言い換えると、「オーディオ信号を受け付ける」とは、オーディオ信号がオーディオ信号処理装置10に入力されることを意味する。各チャネルのオーディオ信号を受け付ける具体的な手法は限定されない。例えば、受付部11はデータベースまたは他の装置にアクセスしてオーディオ信号のデータファイルを読み出すことでそのオーディオ信号を受け付けてもよい。あるいは、受付部11は他の装置から通信ネットワーク経由で送られてきたオーディオ信号を受信してもよい。あるいは、受付部11はオーディオ信号処理装置10で入力されたオーディオ信号を取得してもよい。いずれにしても、受付部11は受け付けた各チャネルのオーディオ信号を分割部12に出力する。 The reception unit 11 is a functional element that receives audio signals of a plurality of channels. “Accepting an audio signal” means that the audio signal processing apparatus 10 acquires an audio signal by an arbitrary method. In other words, “accepting an audio signal” means that the audio signal is input to the audio signal processing apparatus 10. A specific method for receiving the audio signal of each channel is not limited. For example, the reception unit 11 may receive an audio signal by accessing a database or another device and reading out a data file of the audio signal. Or the reception part 11 may receive the audio signal sent via the communication network from the other apparatus. Alternatively, the reception unit 11 may acquire an audio signal input from the audio signal processing device 10. In any case, the receiving unit 11 outputs the received audio signal of each channel to the dividing unit 12.
 分割部12は、各チャネルのオーディオ信号をコヒーレント成分とフィールド成分とに分割する機能要素である。以下の説明は、分割部12が式(4)で示されるNチャネルのオーディオ信号{x(n)|l=1,…,N}を処理することを前提とする。
Figure JPOXMLDOC01-appb-M000004
The dividing unit 12 is a functional element that divides the audio signal of each channel into a coherent component and a field component. The following description is based on the premise that the dividing unit 12 processes the N-channel audio signal {x l (n) | l = 1,..., N} expressed by Expression (4).
Figure JPOXMLDOC01-appb-M000004
 まず、分割部12は各チャネルのオーディオ信号を複数の時間区間の信号に分割する。具体的には、分割部12は窓関数(例えば、カイザー・ベッセル窓)を用いてオーディオ信号を短い時間間隔(これを「フレーム」という)の信号に区切る。例えば、後述する変形離散コサイン変換(MDCT)において1024個の周波数点を用いるのであれば、分割部12は2048点分の長さに相当するカイザー・ベッセル窓を用いてオーディオ信号を複数のフレームに分割する。通常、1フレーム内のサンプル数は適切な周波数分解能が得られるように決められるが、そのサンプル数はコヒーレント成分を推定するには十分ではない。そこで、分割部12は連続する複数のフレーム(例えば24個のフレーム)を一つの時間区間(これを「ブロック」という)の信号として設定する。図4はこのようなブロックの生成の概念を示し、より具体的には、2チャネル(LチャネルおよびRチャネル)のオーディオ信号のそれぞれを複数のブロックに分割する処理を示す。 First, the dividing unit 12 divides the audio signal of each channel into a plurality of time interval signals. Specifically, the dividing unit 12 divides the audio signal into signals having a short time interval (referred to as “frame”) using a window function (for example, Kaiser-Bessel window). For example, if 1024 frequency points are used in the modified discrete cosine transform (MDCT) described later, the dividing unit 12 uses a Kaiser-Bessel window corresponding to the length of 2048 points to divide the audio signal into a plurality of frames. To divide. Usually, the number of samples in one frame is determined so as to obtain an appropriate frequency resolution, but the number of samples is not sufficient for estimating the coherent component. Therefore, the dividing unit 12 sets a plurality of continuous frames (for example, 24 frames) as a signal of one time section (referred to as “block”). FIG. 4 shows the concept of such block generation. More specifically, FIG. 4 shows a process of dividing each of two-channel (L channel and R channel) audio signals into a plurality of blocks.
 各チャネルのオーディオ信号を複数のブロックに分割すると、分割部12は各チャネルの各ブロックに対して以下の処理を実行する。本明細書では、オーディオ信号をコヒーレント成分とフィールド成分とに分ける対象(すなわち、分割処理の対象)となるチャネルを「対象チャネル」という。ここでは、ある一つの対象チャネルにおける処理を説明する。 When the audio signal of each channel is divided into a plurality of blocks, the dividing unit 12 executes the following processing for each block of each channel. In this specification, a channel that is a target for dividing an audio signal into a coherent component and a field component (that is, a target of division processing) is referred to as a “target channel”. Here, processing in a certain target channel will be described.
 分割部12は、対象チャネルのコヒーレント成分を抽出し、その後に該対象チャネルのフィールド成分を抽出する。図5は、その一連の処理の前半に相当する、コヒーレント成分の抽出の概念を示す。分割部12は、フィルタバンクを用いて、対象チャネルであるl番目のチャネルのオーディオ信号x(n)をK個の周波数帯域(サブバンド)の信号(これを「サブバンド信号」という。)に分割する。そして、分割部12は各サブバンドにおいて、対象チャネル以外の他のチャネルのオーディオ信号を用いてコヒーレント成分γ (k)(n)(k=1,…,K)を抽出する。分割部12はこの抽出の際に最小二乗法を用いる。そして、分割部12は全サブバンドのコヒーレント成分を加算することで、対象チャネルのコヒーレント成分γ(n)を抽出する。その後、分割部12は、元のオーディオ信号x(n)からコヒーレント成分γ(n)を差し引くことでフィールド成分φ(n)を抽出する。 The dividing unit 12 extracts a coherent component of the target channel, and then extracts a field component of the target channel. FIG. 5 shows the concept of extraction of coherent components corresponding to the first half of the series of processes. Using the filter bank, the dividing unit 12 converts the audio signal x l (n) of the l-th channel, which is the target channel, into K frequency band (subband) signals (referred to as “subband signals”). Divide into Then, the dividing unit 12 extracts coherent components γ l (k) (n) (k = 1,..., K) in each subband using audio signals of channels other than the target channel. The dividing unit 12 uses a least square method for this extraction. Then, the dividing unit 12 extracts the coherent components γ l (n) of the target channel by adding the coherent components of all the subbands. Thereafter, the dividing unit 12 extracts the field component φ l (n) by subtracting the coherent component γ l (n) from the original audio signal x l (n).
 分割部12は対象チャネルのオーディオ信号の各ブロックについて以下の処理を実行する。 The dividing unit 12 executes the following processing for each block of the audio signal of the target channel.
 分割部12はフィルタバンクを用いて各チャネルのオーディオ信号x(n)をK個のサブバンド信号x (k)(n)に分割する。この分割は式(5)で示される。
Figure JPOXMLDOC01-appb-M000005
The dividing unit 12 divides the audio signal x l (n) of each channel into K subband signals x l (k) (n) using a filter bank. This division is expressed by equation (5).
Figure JPOXMLDOC01-appb-M000005
 なお、式(5)で示されるサブバンド信号x (k)(n)は時間領域での信号であり、したがって、時間領域サブバンド信号である。周波数領域での信号を用いる上記の非特許文献1~4の手法と異なり、オーディオ信号処理装置10は時間領域サブバンド信号を用いるので、連続する任意のフレーム数の信号を一つのブロック信号として処理することで推定区間長を伸ばすことができる。この結果、得られたコヒーレント成分の音質を損なうことなく各チャネルのオーディオ信号を処理することができる。 Incidentally, wherein the subband signals represented by (5) x l (k) (n) is the signal in the time domain, therefore, it is a time-domain subband signals. Unlike the above-described methods of Non-Patent Documents 1 to 4 that use signals in the frequency domain, the audio signal processing apparatus 10 uses time-domain subband signals, and therefore processes a signal with an arbitrary number of consecutive frames as one block signal. By doing so, the estimated section length can be extended. As a result, the audio signal of each channel can be processed without impairing the sound quality of the obtained coherent component.
 続いて、分割部12はこのサブバンド信号x (k)(n)を、対象チャネル以外のN-1個のチャネルの同帯域(同じサブバンド)のサブバンド信号{x (k)(n)|m=1,…,l-1,l+1,…,N}の線形結合から推定する。ある1ブロックに対応するこの線形結合は式(6)で示される。
Figure JPOXMLDOC01-appb-M000006
Subsequently, the dividing unit 12 converts the subband signal x l (k) (n) into subband signals {x m (k) (n) in the same band (same subband) of N−1 channels other than the target channel. n) Estimate from a linear combination of | m = 1,..., l−1, l + 1,. This linear combination corresponding to a certain block is expressed by Equation (6).
Figure JPOXMLDOC01-appb-M000006
 推定信号
Figure JPOXMLDOC01-appb-M000007
は、他チャネル(対象チャネル以外のN-1個のチャネル)の同帯域の信号との相関が高い成分と考えることができる。対象チャネルのサブバンド信号とこの推定信号との推定誤差e (k)(n)は式(7)で示される。
Figure JPOXMLDOC01-appb-M000008
Estimated signal
Figure JPOXMLDOC01-appb-M000007
Can be considered as a component having a high correlation with signals in the same band of other channels (N−1 channels other than the target channel). An estimation error e l (k) (n) between the subband signal of the target channel and the estimated signal is expressed by Expression (7).
Figure JPOXMLDOC01-appb-M000008
 分割部12は、この推定誤差を最小にする係数{a (k)|m=1,…,l-1,l+1,…,N}を最小二乗法で求める。最小化すべき誤差関数は式(8)で示される。
Figure JPOXMLDOC01-appb-M000009
The dividing unit 12 obtains coefficients {a m (k) | m = 1,..., L−1, l + 1,..., N} that minimize the estimation error by the least square method. The error function to be minimized is given by equation (8).
Figure JPOXMLDOC01-appb-M000009
 ここで、
Figure JPOXMLDOC01-appb-M000010
とすると、最適な係数群
Figure JPOXMLDOC01-appb-M000011
は式(9)を満たす。
Figure JPOXMLDOC01-appb-M000012
here,
Figure JPOXMLDOC01-appb-M000010
Then, the optimal coefficient group
Figure JPOXMLDOC01-appb-M000011
Satisfies equation (9).
Figure JPOXMLDOC01-appb-M000012
 この式(9)をm=1,…,l-1,l+1,…,Nで連立させると式(10)が得られる。
Figure JPOXMLDOC01-appb-M000013
ここで、
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000016
である。
When this equation (9) is made simultaneous with m = 1,..., L−1, l + 1,.
Figure JPOXMLDOC01-appb-M000013
here,
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000016
It is.
 k番目のサブバンドにおける対象チャネルの係数ベクトルa^ (k)は式(11)により得られる。
Figure JPOXMLDOC01-appb-M000017
The coefficient vector a ^ l (k) of the target channel in the k-th subband is obtained by Expression (11).
Figure JPOXMLDOC01-appb-M000017
 k番目のサブバンドにおける対象チャネルのコヒーレント成分γ (k)(n)は式(12)により得られる。このコヒーレント成分γ (k)(n)は、対象チャネル以外のチャネルのオーディオ信号を用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号に相当する。
Figure JPOXMLDOC01-appb-M000018
The coherent component γ l (k) (n) of the target channel in the kth subband is obtained by Equation (12). This coherent component γ l (k) (n) corresponds to an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals calculated using the audio signals of channels other than the target channel.
Figure JPOXMLDOC01-appb-M000018
 分割部12はすべてのサブバンドについてコヒーレント成分を求める。そして、分割部12は全サブバンドのコヒーレント成分を加算することで対象チャネルのコヒーレント成分を求める。この処理は式(13)で示される。
Figure JPOXMLDOC01-appb-M000019
The dividing unit 12 obtains coherent components for all subbands. Then, the dividing unit 12 obtains the coherent component of the target channel by adding the coherent components of all the subbands. This process is expressed by equation (13).
Figure JPOXMLDOC01-appb-M000019
 さらに、分割部12は対象チャネルの元のオーディオ信号からそのコヒーレント成分を差し引くことで、対象チャネルのフィールド成分を求める。この処理は上記式(3)で示される。 Further, the dividing unit 12 obtains the field component of the target channel by subtracting the coherent component from the original audio signal of the target channel. This processing is expressed by the above formula (3).
 なお、分割部12は、各サブバンドにおいてオーディオ信号からコヒーレント成分を差し引くことでフィールド成分を求め、全サブバンドのフィールド成分を加算することで対象チャネルのフィールド成分を求めてもよい。具体的には、k番目のサブバンドにおける対象チャネルのフィールド成分φ (k)(n)は式(14)により得られる。
Figure JPOXMLDOC01-appb-M000020
対象チャネルのフィールド成分φ(n)は式(15)により得られる。
Figure JPOXMLDOC01-appb-M000021
The dividing unit 12 may obtain a field component by subtracting a coherent component from the audio signal in each subband, and may obtain a field component of the target channel by adding the field components of all subbands. Specifically, the field component φ l (k) (n) of the target channel in the k-th subband is obtained by Expression (14).
Figure JPOXMLDOC01-appb-M000020
The field component φ l (n) of the target channel is obtained by Equation (15).
Figure JPOXMLDOC01-appb-M000021
 分割部12は上記の処理を対象チャネルのオーディオ信号の各ブロックに対して実行する。そして、分割部12は全ブロックのコヒーレント成分を連結することで対象チャネルのコヒーレント成分を抽出する。また、分割部12は全ブロックのフィールド成分を連結することで対象チャネルのフィールド成分を生成する。 The dividing unit 12 performs the above processing on each block of the audio signal of the target channel. Then, the dividing unit 12 extracts the coherent component of the target channel by connecting the coherent components of all blocks. Further, the dividing unit 12 generates the field component of the target channel by concatenating the field components of all blocks.
 分割部12は複数のチャネルのそれぞれを対象チャネルとして設定して上記の処理を実行することで、全チャネルについてコヒーレント成分およびフィールド成分を生成する。そして、分割部12は全チャネルのコヒーレント成分およびフィールド成分を出力部13に出力する。 The dividing unit 12 generates a coherent component and a field component for all channels by setting each of a plurality of channels as a target channel and executing the above processing. Then, the division unit 12 outputs the coherent components and field components of all channels to the output unit 13.
 このように、分割部12は各チャネルのオーディオ信号に別の信号を追加することなく(すなわち、原音に別の音を追加することなく)、各チャネルのオーディオ信号をコヒーレント成分とフィールド成分とに分割する。 As described above, the dividing unit 12 does not add another signal to the audio signal of each channel (that is, without adding another sound to the original sound), and converts the audio signal of each channel into a coherent component and a field component. To divide.
 出力部13は、分割部12により生成された各チャネルのコヒーレント成分およびフィールド成分を処理結果として出力する機能要素である。この処理結果は、Nチャネルから2Nチャネルへのアップミックスを実現したものであるということができる。処理結果の出力方法は何ら限定されない。例えば、出力部13は処理結果をメモリやデータベースなどの記憶装置に格納してもよいし、通信ネットワークを介して他の装置に送信してもよい。あるいは、出力部13は各チャネルのコヒーレント成分およびフィールド成分を対応するスピーカに出力してもよい。いずれにしても、オーディオ信号処理装置10による処理結果を用いて、既存の音声素材を、より多くのチャネル数を持つコンテンツの制作に利用したり、より多くのチャネルを有するオーディオ・システムで再生したりすることが可能になる。 The output unit 13 is a functional element that outputs the coherent component and field component of each channel generated by the dividing unit 12 as a processing result. This processing result can be said to be an upmix from N channel to 2N channel. The output method of the processing result is not limited at all. For example, the output unit 13 may store the processing result in a storage device such as a memory or a database, or may transmit the processing result to another device via a communication network. Alternatively, the output unit 13 may output the coherent component and field component of each channel to a corresponding speaker. In any case, using the processing result of the audio signal processing apparatus 10, existing audio material can be used for production of contents having a larger number of channels, or reproduced by an audio system having a larger number of channels. It becomes possible to do.
 オーディオ信号処理装置10は、Nチャネルのオーディオ信号を2Nより大きい数のチャネルにアップミックスしてもよい。具体的には、オーディオ信号処理装置10は、抽出した複数のフィールド成分を下記参考文献に記載の手法で無相関化することで、チャネル間の相関が互いに異なる信号を生成する。これにより、Nより多い個数のフィールド成分が得られる。例えば、ステレオの音声素材を5.1チャネルの音声素材に変換したり、5.1チャネルのオーディオ・システムを用いてより高い臨場感で再生したりすることができる。あるいは、5.1チャネルの音声素材を22.2チャネルの音声素材に変換したり、22.2チャネルのオーディオ・システムを用いてより高い臨場感で再生したりすることができる。
 (参考文献)J. Breebaart and C. Fallar, “Spatial Audio Processing - MPEG Surround and Other Applications,” Wiley, 2007.
The audio signal processing apparatus 10 may upmix an N-channel audio signal into a number of channels larger than 2N. Specifically, the audio signal processing apparatus 10 generates signals having different correlations between channels by decorrelating the extracted plurality of field components using a technique described in the following reference. Thereby, more than N field components are obtained. For example, stereo audio material can be converted into 5.1 channel audio material, and can be reproduced with higher presence using a 5.1 channel audio system. Alternatively, 5.1 channel audio material can be converted to 22.2 channel audio material, or reproduced with higher presence using a 22.2 channel audio system.
(Reference) J. Breebaart and C. Fallar, “Spatial Audio Processing-MPEG Surround and Other Applications,” Wiley, 2007.
 オーディオ信号処理装置10は、Nチャネルのオーディオ信号を、2Nより小さいJ個のオーディオ信号(ただし、J>N)のオーディオ信号にアップミックスしてもよい。具体的には、オーディオ信号処理装置10はN個のフィールド成分をミキシングすることで、NチャネルからJチャネルへのアップミックスを実現する。 The audio signal processing apparatus 10 may upmix the N-channel audio signal into audio signals of J audio signals smaller than 2N (where J> N). Specifically, the audio signal processing apparatus 10 realizes an upmix from the N channel to the J channel by mixing N field components.
 オーディオ信号処理装置10による処理結果はアップミックスだけでなくダウンミックスにも利用可能である。 The processing result by the audio signal processing apparatus 10 can be used not only for upmixing but also for downmixing.
 次に、図6および図7を参照しながら、オーディオ信号処理装置10の動作を説明するとともに本実施形態に係るオーディオ信号処理方法について説明する。オーディオ信号処理装置10では、まず、受付部11が複数のチャネルのオーディオ信号を受け付ける(受付ステップ)。続いて、分割部12がオーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する(分割ステップ)。そして、出力部13が各チャネルのコヒーレント成分およびフィールド成分を出力する(出力ステップ)。以下では、特に重要な分割部12の処理(分割ステップ)について詳しく説明する。 Next, the operation of the audio signal processing apparatus 10 will be described with reference to FIGS. 6 and 7, and the audio signal processing method according to the present embodiment will be described. In the audio signal processing apparatus 10, first, the reception unit 11 receives audio signals of a plurality of channels (reception step). Subsequently, the dividing unit 12 executes a dividing process for dividing each audio signal into a coherent component and a field component for each channel (dividing step). Then, the output unit 13 outputs the coherent component and field component of each channel (output step). Hereinafter, a particularly important process (dividing step) of the dividing unit 12 will be described in detail.
 図6は、一つの対象チャネルのコヒーレント成分およびフィールド成分を生成する処理を示す。 FIG. 6 shows a process of generating a coherent component and a field component of one target channel.
 まず、分割部12は各チャネルのオーディオ信号を複数のブロックに分割する(ステップS11)。なお、ステップS11において分割した各チャネルおよび各ブロックのオーディオ信号を保存することで、2番目以降の対象チャネルを処理する際にはステップS11を省略することができる。 First, the dividing unit 12 divides the audio signal of each channel into a plurality of blocks (step S11). Note that by storing the audio signal of each channel and each block divided in step S11, step S11 can be omitted when processing the second and subsequent target channels.
 続いて、分割部12は対象チャネルの複数のブロックのうちの一つを処理対象として設定する(ステップS12)。続いて、分割部12は、対象チャネル以外のチャネルのオーディオ信号を用いて算出される推定信号のうち、対象チャネルのオーディオ信号との相関が最も高い推定信号を、対象チャネルのコヒーレント成分として抽出する(ステップS13)。続いて、分割部12は、対象チャネルのオーディオ信号とそのコヒーレント成分との差分を、対象チャネルのフィールド成分として抽出する(ステップS14)。このような処理により、分割部12は対象チャネルの1ブロックのコヒーレント成分およびフィールド成分を得る。 Subsequently, the dividing unit 12 sets one of a plurality of blocks of the target channel as a processing target (step S12). Subsequently, the dividing unit 12 extracts an estimated signal having the highest correlation with the audio signal of the target channel from among the estimated signals calculated using the audio signals of channels other than the target channel as a coherent component of the target channel. (Step S13). Subsequently, the dividing unit 12 extracts a difference between the audio signal of the target channel and the coherent component thereof as a field component of the target channel (step S14). By such processing, the dividing unit 12 obtains a coherent component and a field component of one block of the target channel.
 分割部12は一つのブロックを処理すると次のブロックの処理に移る(ステップS15参照)。すなわち、分割部12は次のブロックを処理対象として設定し(ステップS12)、そのブロックのコヒーレント成分およびフィールド成分を生成する(ステップS13およびS14)。分割部12はすべてのブロックについてステップS12~S14の処理を実行し、全ブロックのコヒーレント成分およびフィールド成分を生成する(ステップS15においてYES)。そして、分割部12は全ブロックのコヒーレント成分を連結することで対象チャネルの最終的なコヒーレント成分を得ると共に、全ブロックのフィールド成分を連結することで対象チャネルの最終的なフィールド成分を得る。 When the dividing unit 12 processes one block, the process proceeds to the next block (see step S15). That is, the dividing unit 12 sets the next block as a processing target (step S12), and generates a coherent component and a field component of the block (steps S13 and S14). The dividing unit 12 executes the processing of steps S12 to S14 for all blocks, and generates coherent components and field components of all blocks (YES in step S15). Then, the dividing unit 12 obtains the final coherent component of the target channel by concatenating the coherent components of all blocks, and obtains the final field component of the target channel by concatenating the field components of all blocks.
 図7は、図6におけるステップS13の処理の詳細、すなわち、対象チャネルのコヒーレント成分を生成する処理の詳細を示す。図7に示す処理は対象チャネルのオーディオ信号の各ブロックについて実行される。 FIG. 7 shows details of the processing in step S13 in FIG. 6, that is, details of processing for generating a coherent component of the target channel. The process shown in FIG. 7 is executed for each block of the audio signal of the target channel.
 まず、分割部12は各チャネル(対象チャネルおよびすべての他チャネル)について、ブロック信号を複数のサブバンドに分割することで複数のサブバンド信号を生成する(ステップS131)。続いて、分割部12は複数のサブバンドのうちの一つを処理対象として設定する(ステップS132)。続いて、分割部12は、対象チャネル以外のチャネルのサブバンド信号を用いて算出される推定信号のうち、対象チャネルのサブバンド信号との相関が最も高い推定信号を、処理対象であるサブバンドにおける対象チャネルのコヒーレント成分として抽出する(ステップS133)。分割部12はすべてのサブバンドについてステップS132およびS133の処理を実行する(ステップS134参照)。対象チャネルについて全サブバンドのコヒーレント成分を生成すると(ステップS134においてYES)、分割部12はそれらのコヒーレント成分を加算することで対象チャネルのコヒーレント成分(より具体的には、1ブロック分のコヒーレント成分)を生成する(ステップS135)。 First, the dividing unit 12 generates a plurality of subband signals by dividing the block signal into a plurality of subbands for each channel (target channel and all other channels) (step S131). Subsequently, the dividing unit 12 sets one of a plurality of subbands as a processing target (step S132). Subsequently, the dividing unit 12 selects an estimated signal having the highest correlation with the subband signal of the target channel from among the estimated signals calculated using the subband signals of channels other than the target channel. As a coherent component of the target channel at (step S133). The dividing unit 12 executes the processes of steps S132 and S133 for all subbands (see step S134). When the coherent components of all subbands are generated for the target channel (YES in step S134), the dividing unit 12 adds the coherent components to add the coherent components of the target channel (more specifically, the coherent components for one block). ) Is generated (step S135).
 次に、図8を参照しながら、コンピュータをオーディオ信号処理装置10として機能させるためのオーディオ信号処理プログラムP1を説明する。 Next, an audio signal processing program P1 for causing a computer to function as the audio signal processing apparatus 10 will be described with reference to FIG.
 オーディオ信号処理プログラムP1はメインモジュールP10、受付モジュールP11、分割モジュールP12、および出力モジュールP13を含む。メインモジュールP10は、オーディオ信号の処理を統括的に実行する部分である。受付モジュールP11、分割モジュールP12、および出力モジュールP13を実行することにより実現される機能はそれぞれ、上記の受付部11、分割部12、および出力部13の機能と同様である。 The audio signal processing program P1 includes a main module P10, a reception module P11, a division module P12, and an output module P13. The main module P10 is a part that performs overall processing of audio signals. The functions realized by executing the reception module P11, the division module P12, and the output module P13 are the same as the functions of the reception unit 11, the division unit 12, and the output unit 13, respectively.
 オーディオ信号処理プログラムP1は、例えば、CD-ROMやDVD-ROM、半導体メモリなどの有形の記録媒体に固定的に記録された上で提供されてもよい。あるいは、オーディオ信号処理プログラムP1は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The audio signal processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Alternatively, the audio signal processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.
 以上説明したように、本発明の一側面に係るオーディオ信号処理装置は、複数のチャネルのオーディオ信号を受け付ける受付部と、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割部と、分割部により抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力部とを備える。 As described above, the audio signal processing device according to one aspect of the present invention executes a reception unit that receives audio signals of a plurality of channels and a division process that divides the audio signal into coherent components and field components for each channel. A dividing unit that performs division processing when the target channel is one channel that is the target of the division processing, the target of the estimated signals calculated using at least audio signals of channels other than the target channel An estimation signal having the highest correlation with the audio signal of the channel is extracted as a coherent component of the target channel, and a difference between the audio signal of the target channel and the coherent component of the target channel is extracted as a field component of the target channel. Including a step, and the dividing unit and the dividing unit And an output unit for outputting the coherent component and field component of each issued channels.
 本発明の一側面に係るオーディオ信号処理方法は、オーディオ信号処理装置が、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号処理装置が、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、オーディオ信号処理装置が、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとを含む。 An audio signal processing method according to an aspect of the present invention includes an accepting step in which an audio signal processing device receives audio signals of a plurality of channels, and a division in which the audio signal processing device divides the audio signal into a coherent component and a field component. A division step for performing processing for each channel, and the division processing is calculated using at least an audio signal of a channel other than the target channel when one channel to be divided is set as a target channel. Extracting an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals as a coherent component of the target channel; and calculating a difference between the audio signal of the target channel and the coherent component of the target channel. Extract as field component of And a step comprises the said dividing step, the audio signal processing device, and an output step of outputting coherent component and field component of each channel extracted in dividing step.
 本発明の一側面に係るオーディオ信号処理プログラムは、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとをコンピュータに実行させる。 An audio signal processing program according to one aspect of the present invention is a reception step for receiving audio signals of a plurality of channels, and a division step for executing division processing for dividing the audio signal into coherent components and field components for each channel. In the case where the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel. Split steps and split steps To execute an output step of outputting coherent component and field component of each channel extracted in up to the computer.
 このような側面においては、対象チャネル以外のチャネルのオーディオ信号を用いて推定され、且つ該対象チャネルの実際のオーディオ信号ごとの相関が最も高い信号が該対象チャネルのコヒーレント成分として抽出される。また、対象チャネルの実際のオーディオ信号とそのコヒーレント成分との差分が該対象チャネルのフィールド成分として抽出される。このコヒーレント成分およびフィールド成分は各チャネルについて得られる。このように、音を追加することなく元のオーディオ信号のみを用いて各チャネルのコヒーレント成分およびフィールド成分を求めることで、原音の雰囲気(例えば本来の音色)を可能な限りまたは完全に維持することができる。加えて、コヒーレント成分およびフィールド成分は元のチャネル数の分だけ求めることができるので、この手法は原音のチャネル数にかかわらず適用できる。例えば、本発明の一側面は2チャネル、3チャネル、5.1チャネル、22.2チャネルなどの任意のチャネル数のオーディオ信号に対して適用できる。 In such an aspect, a signal that is estimated using an audio signal of a channel other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel. This coherent component and field component are obtained for each channel. In this way, the coherent and field components of each channel are determined using only the original audio signal without adding sound, so that the atmosphere of the original sound (for example, the original tone) is maintained as completely or completely as possible. Can do. In addition, since the coherent component and the field component can be obtained by the number of original channels, this method can be applied regardless of the number of channels of the original sound. For example, one aspect of the present invention can be applied to audio signals having an arbitrary number of channels such as 2 channels, 3 channels, 5.1 channels, and 22.2 channels.
 図9および図10を用いて上記側面の優位性を説明する。図9は従来の手法におけるコヒーレント成分の抽出の例を示す図であり、図10は上記側面におけるコヒーレント成分の抽出の例を示す図である。図9,10共に、三角形状に配置された三つのスピーカ90からオーディオ信号が出力される例を示し、したがって、この例は3チャネルのオーディオ・システムを示す。 The superiority of the above aspect will be described with reference to FIGS. 9 and 10. FIG. 9 is a diagram illustrating an example of extraction of a coherent component in a conventional method, and FIG. 10 is a diagram illustrating an example of extraction of a coherent component in the above-described aspect. 9 and 10 both show an example in which audio signals are output from three speakers 90 arranged in a triangular shape, and thus this example shows a three-channel audio system.
 図9に示すように、上記の非特許文献3,4に記載の手法では、2チャネルのオーディオ信号の間で相関が高い成分をコヒーレント成分91として抽出する(なお、破線92はフィールド成分を示す)。したがって、このような従来の手法では、二つのスピーカ(チャネル)90の中間部分93に位置する音の情報しか取得することができず、三つのスピーカ(チャネル)90で囲まれた領域の中央部分94に位置する音の情報を抽出することができない。 As shown in FIG. 9, in the methods described in Non-Patent Documents 3 and 4 described above, a component having a high correlation between two-channel audio signals is extracted as a coherent component 91 (note that a broken line 92 indicates a field component) ). Therefore, in such a conventional method, only the information of the sound located in the middle portion 93 of the two speakers (channels) 90 can be acquired, and the central portion of the region surrounded by the three speakers (channels) 90 Information on the sound located at 94 cannot be extracted.
 これに対して上記側面では、あるスピーカ(チャネル)90のコヒーレント成分が他のスピーカ(チャネル)90の信号から推定される。そのため、図10に示すように、三つのスピーカ(チャネル)90で囲まれた領域の中央部分95に位置する音の情報を抽出することができる。この中央部分95は、図9における部分93,94の和に相当し得る。 On the other hand, in the above aspect, the coherent component of one speaker (channel) 90 is estimated from the signal of another speaker (channel) 90. Therefore, as shown in FIG. 10, it is possible to extract information on the sound located in the central portion 95 of the area surrounded by the three speakers (channels) 90. This central portion 95 may correspond to the sum of the portions 93 and 94 in FIG.
 他の側面に係るオーディオ信号処理装置では、分割処理が、窓関数を用いてオーディオ信号を複数のフレームに区切る処理を各チャネルについて実行するステップと、連続する少なくとも二つのフレームを一つのブロックにまとめる処理を複数のフレームの全体に対して実行することで複数のブロックを生成する処理を各チャネルについて実行するステップと、ブロックのそれぞれにおいて対象チャネルのコヒーレント成分を抽出するステップとを含んでもよい。 In the audio signal processing device according to another aspect, the dividing process performs a process of dividing the audio signal into a plurality of frames using a window function for each channel, and combines at least two consecutive frames into one block. The process of generating a plurality of blocks by executing the process on the whole of the plurality of frames may be executed for each channel, and the step of extracting the coherent component of the target channel in each of the blocks may be included.
 複数のフレームで構成されるブロックを採用することで、コヒーレント成分の推定のためのサンプル数が多くなるので、コヒーレント成分をより精度良く抽出することが可能になる。 By adopting a block composed of a plurality of frames, the number of samples for estimating the coherent component increases, so that the coherent component can be extracted with higher accuracy.
 他の側面に係るオーディオ信号処理装置では、分割部が、各チャネルのオーディオ信号を複数のサブバンドに分割することで、各チャネルについて複数のサブバンド信号を生成するステップと、複数のサブバンドのそれぞれにおいて対象チャネルのコヒーレント成分を抽出するステップと、複数のサブバンドにおけるコヒーレント成分を加算することで対象チャネルのコヒーレント成分を抽出するステップとを含んでもよい。 In the audio signal processing device according to another aspect, the dividing unit divides the audio signal of each channel into a plurality of subbands, thereby generating a plurality of subband signals for each channel; Extracting the coherent component of the target channel in each, and extracting the coherent component of the target channel by adding the coherent components in a plurality of subbands may be included.
 一般に、音声処理では一部の周波数が他の周波数よりも重要であることが多い。サブバンド毎に処理することで、それぞれの周波数帯で要求される精度に応じてコヒーレント成分を抽出することができ、ひいてはコヒーレント成分およびフィールド成分を精度良く抽出することができる。 In general, in audio processing, some frequencies are often more important than other frequencies. By performing processing for each subband, a coherent component can be extracted according to the accuracy required in each frequency band, and thus a coherent component and a field component can be extracted with high accuracy.
 以下、実施例に基づいて本発明を具体的に説明するが、本発明はそれらに何ら限定されるものではない。 Hereinafter, the present invention will be specifically described based on examples, but the present invention is not limited thereto.
 表1に示される7個のステレオ音声素材(すなわち、2チャネルのオーディオ信号)を用意した。いずれの音声素材も市販のCDから入手したものであり、サンプリング周波数は44.1kHzであった。表1の名前欄は曲名または楽曲の種類を示し、説明欄は演奏の形態を示す。ミキシング欄における「Artifical」はミキシング処理が施された素材であることを示し、「Natural」はミキシング処理が施されていない素材であることを示す。長さ欄は再生時間を示す。
Figure JPOXMLDOC01-appb-T000022
Seven stereo sound materials (that is, 2-channel audio signals) shown in Table 1 were prepared. All audio materials were obtained from commercially available CDs, and the sampling frequency was 44.1 kHz. The name column in Table 1 shows the song name or the type of song, and the explanation column shows the form of performance. “Artifical” in the mixing column indicates that the material has been subjected to mixing processing, and “Natural” indicates that the material has not been subjected to mixing processing. The length column shows the playback time.
Figure JPOXMLDOC01-appb-T000022
 オーディオ信号を完全に再構築できるフィルタバンクを構築するために、変形離散コサイン変換(MDCT)を用いた重畳加算法を採用した。オーディオ信号を複数のフレームに分割するための窓関数としてカイザー・ベッセル窓を用いた。フレーム長は2048点とし、これは、MDCTにおいて1024個の周波数点が得られることを意味する。その周波数点を表2に示すように23個のサブバンドにまとめた。これらのサブバンドは、MPEG-2 AAC標準を参考に、48kHz long FFT(高速フーリエ変換)における69個のサブバンドを三つの連続するサブバンド毎に一つにまとめたものである。24個のフレームを1ブロックとした。サンプリング周波数が44.1kHzであれば、ブロック長は0.58秒に相当するものであった。
Figure JPOXMLDOC01-appb-T000023
In order to construct a filter bank that can completely reconstruct the audio signal, a superposition addition method using a modified discrete cosine transform (MDCT) was employed. The Kaiser-Bessel window was used as a window function for dividing the audio signal into a plurality of frames. The frame length is 2048 points, which means that 1024 frequency points are obtained in MDCT. The frequency points were grouped into 23 subbands as shown in Table 2. These subbands are a collection of 69 subbands in a 48 kHz long FFT (Fast Fourier Transform), one for every three consecutive subbands, with reference to the MPEG-2 AAC standard. 24 frames were taken as one block. If the sampling frequency was 44.1 kHz, the block length was equivalent to 0.58 seconds.
Figure JPOXMLDOC01-appb-T000023
 実験結果をチャネル間の相互相関係数で評価した。原音、コヒーレント成分、およびフィールド成分の相互相関係数を表3に示す。コヒーレント成分は原音よりも高い相互相関を示した。このようなコヒーレント成分は原音よりも狭い音場の雰囲気をもたらす。一方、フィールド成分は、一個の素材(“Quiet Night”)を除いて負の相互相関を示した。負の相互相関を示すフィールド成分を側方もしくは後方に設置したスピーカで再生すれば、良好なアンビエンス効果が得られる。その結果として、臨場感の高い音を再生することができる。
Figure JPOXMLDOC01-appb-T000024
The experimental results were evaluated with the cross-correlation coefficient between channels. Table 3 shows cross-correlation coefficients of the original sound, the coherent component, and the field component. The coherent component showed higher cross-correlation than the original sound. Such a coherent component provides a sound field atmosphere narrower than the original sound. On the other hand, the field component showed a negative cross-correlation except for one material (“Quiet Night”). If a field component showing a negative cross-correlation is reproduced by a speaker installed on the side or rear, a good ambience effect can be obtained. As a result, it is possible to reproduce a sound with a high presence.
Figure JPOXMLDOC01-appb-T000024
 以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 The present invention has been described in detail above based on the embodiments. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist thereof.
 上記実施形態では、分割部12が、ある一つの対象チャネルのコヒーレント成分を、該対象チャネル以外のチャネルのオーディオ信号を用いて推定した。この変形例として、分割部は、当該他チャネルのオーディオ信号と、対象チャネルの過去のオーディオ信号および当該他チャネルの過去のオーディオ信号の少なくとも一方とを用いて、該対象チャネルのコヒーレント成分を推定してもよい。ここで、「過去のオーディオ信号」とは、処理対象のブロックより時間的に前のブロックのオーディオ信号である。対象チャネルおよび他チャネルのうちの一方または双方の過去のオーディオ信号も用いて、処理対象のブロックにおける対象チャネルのオーディオ信号を推定することで、コヒーレント成分をより精度良く抽出することが期待できる。 In the above embodiment, the dividing unit 12 estimates a coherent component of a certain target channel using an audio signal of a channel other than the target channel. As a modification, the dividing unit estimates the coherent component of the target channel using the audio signal of the other channel and at least one of the past audio signal of the target channel and the past audio signal of the other channel. May be. Here, the “past audio signal” is an audio signal of a block temporally preceding the block to be processed. By estimating the audio signal of the target channel in the block to be processed using past audio signals of one or both of the target channel and the other channel, it can be expected to extract the coherent component with higher accuracy.
 少なくとも一つのプロセッサにより実行されるオーディオ信号処理方法の手順は上記実施形態での例に限定されない。例えば、オーディオ信号処理装置は上述したステップ(処理)の一部を省略してもよいし、別の順序で各ステップを実行してもよい。また、上述したステップのうちの任意の2以上のステップが組み合わされてもよいし、ステップの一部が修正又は削除されてもよい。あるいは、オーディオ信号処理装置は上記の各ステップに加えて他のステップを実行してもよい。 The procedure of the audio signal processing method executed by at least one processor is not limited to the example in the above embodiment. For example, the audio signal processing apparatus may omit some of the steps (processes) described above, or may execute the steps in a different order. Also, any two or more of the steps described above may be combined, or a part of the steps may be corrected or deleted. Alternatively, the audio signal processing apparatus may execute other steps in addition to the above steps.
 オーディオ信号処理装置は、二つの数値の大小関係を比較する際に、「以上」および「よりも大きい」という二つの基準のどちらを用いてもよく、「以下」および「未満」の二つの基準のうちのどちらを用いてもよい。このような基準の選択は、二つの数値の大小関係を比較する処理についての技術的意義を変更するものではない。 The audio signal processing apparatus may use either of the two criteria “greater than” and “greater than” when comparing the magnitude relationship between the two values, and the two criteria “less than” and “less than”. Either of these may be used. The selection of such a standard does not change the technical significance of the process of comparing the magnitude relationship between two numerical values.
 10…オーディオ信号処理装置、11…受付部、12…分割部、13…出力部、el…推定誤差、P1…オーディオ信号処理プログラム、P10…メインモジュール、P11…受付モジュール、P12…分割モジュール、P13…出力モジュール。 DESCRIPTION OF SYMBOLS 10 ... Audio signal processing apparatus, 11 ... Acceptance part, 12 ... Dividing part, 13 ... Output part, el ... Estimation error, P1 ... Audio signal processing program, P10 ... Main module, P11 ... Accepting module, P12 ... Dividing module, P13 ... output module.

Claims (5)

  1.  複数のチャネルのオーディオ信号を受け付ける受付部と、
     前記オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、前記分割処理が、
      前記分割処理の対象となる一つの前記チャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルの前記オーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルの前記オーディオ信号との相関が最も高い推定信号を該対象チャネルの前記コヒーレント成分として抽出するステップと、
      前記対象チャネルの前記オーディオ信号と該対象チャネルの前記コヒーレント成分との差分を該対象チャネルの前記フィールド成分として抽出するステップと
    を含む、該分割部と、
     前記分割部により抽出された各チャネルの前記コヒーレント成分および前記フィールド成分を出力する出力部と
    を備えるオーディオ信号処理装置。
    A reception unit for receiving audio signals of a plurality of channels;
    A division unit that performs a division process for dividing the audio signal into a coherent component and a field component for each channel, wherein the division process includes:
    Correlation with the audio signal of the target channel among estimated signals calculated using at least the audio signal of a channel other than the target channel when one channel to be divided is the target channel Extracting the highest estimated signal as the coherent component of the target channel;
    Extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel; and
    An audio signal processing apparatus comprising: an output unit that outputs the coherent component and the field component of each channel extracted by the dividing unit.
  2.  前記分割処理が、
      窓関数を用いてオーディオ信号を複数のフレームに区切る処理を各チャネルについて実行するステップと、
      連続する少なくとも二つの前記フレームを一つのブロックにまとめる処理を前記複数のフレームの全体に対して実行することで複数の前記ブロックを生成する処理を各チャネルについて実行するステップと、
      前記ブロックのそれぞれにおいて前記対象チャネルの前記コヒーレント成分を抽出するステップと
    を含む、
    請求項1に記載のオーディオ信号処理装置。
    The dividing process is
    Performing for each channel a process of dividing the audio signal into a plurality of frames using a window function;
    Executing a process of generating a plurality of the blocks by executing a process of grouping at least two consecutive frames into one block on the whole of the plurality of frames, and
    Extracting the coherent component of the channel of interest in each of the blocks.
    The audio signal processing apparatus according to claim 1.
  3.  前記分割部が、
      各チャネルのオーディオ信号を複数のサブバンドに分割することで、各チャネルについて複数のサブバンド信号を生成するステップと、
      前記複数のサブバンドのそれぞれにおいて前記対象チャネルのコヒーレント成分を抽出するステップと、
      前記複数のサブバンドにおけるコヒーレント成分を加算することで前記対象チャネルのコヒーレント成分を抽出するステップと
    を含む、
    請求項1または2に記載のオーディオ信号処理装置。
    The dividing unit is
    Generating a plurality of subband signals for each channel by dividing the audio signal of each channel into a plurality of subbands;
    Extracting a coherent component of the target channel in each of the plurality of subbands;
    Extracting the coherent component of the target channel by adding coherent components in the plurality of subbands.
    The audio signal processing apparatus according to claim 1 or 2.
  4.  オーディオ信号処理装置が、複数のチャネルのオーディオ信号を受け付ける受付ステップと、
     前記オーディオ信号処理装置が、前記オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、前記分割処理が、
      前記分割処理の対象となる一つの前記チャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルの前記オーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルの前記オーディオ信号との相関が最も高い推定信号を該対象チャネルの前記コヒーレント成分として抽出するステップと、
      前記対象チャネルの前記オーディオ信号と該対象チャネルの前記コヒーレント成分との差分を該対象チャネルの前記フィールド成分として抽出するステップと
    を含む、該分割ステップと、
     前記オーディオ信号処理装置が、前記分割ステップにおいて抽出された各チャネルの前記コヒーレント成分および前記フィールド成分を出力する出力ステップと
    を含むオーディオ信号処理方法。
    An accepting step in which the audio signal processing device accepts audio signals of a plurality of channels;
    The audio signal processing apparatus is a division step for performing division processing for dividing each of the audio signals into a coherent component and a field component for each channel, and the division processing includes:
    Correlation with the audio signal of the target channel among estimated signals calculated using at least the audio signal of a channel other than the target channel when one channel to be divided is the target channel Extracting the highest estimated signal as the coherent component of the target channel;
    Extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel; and
    An audio signal processing method, wherein the audio signal processing device includes an output step of outputting the coherent component and the field component of each channel extracted in the dividing step.
  5.  複数のチャネルのオーディオ信号を受け付ける受付ステップと、
     前記オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、前記分割処理が、
      前記分割処理の対象となる一つの前記チャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルの前記オーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルの前記オーディオ信号との相関が最も高い推定信号を該対象チャネルの前記コヒーレント成分として抽出するステップと、
      前記対象チャネルの前記オーディオ信号と該対象チャネルの前記コヒーレント成分との差分を該対象チャネルの前記フィールド成分として抽出するステップと
    を含む、該分割ステップと、
     前記分割ステップにおいて抽出された各チャネルの前記コヒーレント成分および前記フィールド成分を出力する出力ステップと
    をコンピュータに実行させるオーディオ信号処理プログラム。
    A reception step for receiving audio signals of a plurality of channels;
    A division step of performing a division process for dividing the audio signal into a coherent component and a field component for each channel, wherein the division process includes:
    Correlation with the audio signal of the target channel among estimated signals calculated using at least the audio signal of a channel other than the target channel when one channel to be divided is the target channel Extracting the highest estimated signal as the coherent component of the target channel;
    Extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel; and
    An audio signal processing program causing a computer to execute the output step of outputting the coherent component and the field component of each channel extracted in the dividing step.
PCT/JP2017/016019 2016-04-27 2017-04-21 Audio signal processing device, audio signal processing method, and audio signal processing program WO2017188141A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2018514561A JP6846822B2 (en) 2016-04-27 2017-04-21 Audio signal processor, audio signal processing method, and audio signal processing program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016-089417 2016-04-27
JP2016089417 2016-04-27

Publications (1)

Publication Number Publication Date
WO2017188141A1 true WO2017188141A1 (en) 2017-11-02

Family

ID=60161634

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/016019 WO2017188141A1 (en) 2016-04-27 2017-04-21 Audio signal processing device, audio signal processing method, and audio signal processing program

Country Status (2)

Country Link
JP (1) JP6846822B2 (en)
WO (1) WO2017188141A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008536183A (en) * 2005-04-15 2008-09-04 コーディング テクノロジーズ アクチボラゲット Envelope shaping of uncorrelated signals
JP2013517518A (en) * 2010-01-15 2013-05-16 フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ Apparatus and method for extracting direct / ambience signal from downmix signal and spatial parameter information
JP2016501472A (en) * 2012-11-15 2016-01-18 フラウンホーファー−ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Segment-by-segment adjustments to different playback speaker settings for spatial audio signals

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008536183A (en) * 2005-04-15 2008-09-04 コーディング テクノロジーズ アクチボラゲット Envelope shaping of uncorrelated signals
JP2013517518A (en) * 2010-01-15 2013-05-16 フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ Apparatus and method for extracting direct / ambience signal from downmix signal and spatial parameter information
JP2016501472A (en) * 2012-11-15 2016-01-18 フラウンホーファー−ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Segment-by-segment adjustments to different playback speaker settings for spatial audio signals

Also Published As

Publication number Publication date
JPWO2017188141A1 (en) 2019-03-07
JP6846822B2 (en) 2021-03-24

Similar Documents

Publication Publication Date Title
JP6637014B2 (en) Apparatus and method for multi-channel direct and environmental decomposition for audio signal processing
US8346565B2 (en) Apparatus and method for generating an ambient signal from an audio signal, apparatus and method for deriving a multi-channel audio signal from an audio signal and computer program
CN101842834B (en) Device and method for generating a multi-channel signal using voice signal processing
CA2820351C (en) Apparatus and method for decomposing an input signal using a pre-calculated reference curve
JP5379838B2 (en) Apparatus for determining spatial output multi-channel audio signals
US8817991B2 (en) Advanced encoding of multi-channel digital audio signals
CN102907120B (en) For the system and method for acoustic processing
JP6198800B2 (en) Apparatus and method for generating an output signal having at least two output channels
GB2540175A (en) Spatial audio processing apparatus
JPWO2005112002A1 (en) Audio signal encoding apparatus and audio signal decoding apparatus
US9913036B2 (en) Apparatus and method and computer program for generating a stereo output signal for providing additional output channels
WO2022014326A1 (en) Signal processing device, method, and program
WO2017188141A1 (en) Audio signal processing device, audio signal processing method, and audio signal processing program
EP4252432A1 (en) Systems and methods for audio upmixing
Kraft et al. Low-complexity stereo signal decomposition and source separation for application in stereo to 3D upmixing
JP6694755B2 (en) Channel number converter and its program
AU2015238777B2 (en) Apparatus and Method for Generating an Output Signal having at least two Output Channels
WO2013176073A1 (en) Audio signal conversion device, method, program, and recording medium
CN116643712A (en) Electronic device, system and method for audio processing, and computer-readable storage medium
AU2012252490A1 (en) Apparatus and method for generating an output signal employing a decomposer

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018514561

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17789424

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17789424

Country of ref document: EP

Kind code of ref document: A1