KR20130037910A - Openvg based multi-layer algorithm to determine the position of the nested part - Google Patents

Openvg based multi-layer algorithm to determine the position of the nested part Download PDF

Info

Publication number
KR20130037910A
KR20130037910A KR1020110102447A KR20110102447A KR20130037910A KR 20130037910 A KR20130037910 A KR 20130037910A KR 1020110102447 A KR1020110102447 A KR 1020110102447A KR 20110102447 A KR20110102447 A KR 20110102447A KR 20130037910 A KR20130037910 A KR 20130037910A
Authority
KR
South Korea
Prior art keywords
correlation
normal cross
algorithm
calculation
wsola
Prior art date
Application number
KR1020110102447A
Other languages
Korean (ko)
Inventor
정준영
이성로
정민아
박재희
김진우
김준석
박선
박희만
Original Assignee
목포대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 목포대학교산학협력단 filed Critical 목포대학교산학협력단
Priority to KR1020110102447A priority Critical patent/KR20130037910A/en
Publication of KR20130037910A publication Critical patent/KR20130037910A/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method of determining position coordinates of an OpenVG based multi-layer overlapping portion is disclosed. This method determines the optimal output signal through the calculation of mutual similarity and performs time-base transformation based on the waveform similarity overlap-and-add (WSOLA) algorithm to prevent phase distortion at the junction. In this case, the WSOLA algorithm-based time-base transformation is performed for the initial operation of the input speed and the first frame, and the optimal moving point of the next frame is predicted using the estimated pitch, and the smallest range is estimated based on the predicted optimal moving point. A normal cross-correlation calculation is performed for the normal cross-correlation, and a pitch change of a speech signal between the frames is detected using the normal cross-correlation threshold, and when the normal cross-correlation value has a value less than or equal to the threshold value, Normal cross-correlation is also calculated by extending the calculation range as a whole.

Description

OpenVG based multi-layer algorithm to determine the position of the nested part}

The present invention relates to a method of efficiently performing a superposition-based time base transformation technique in audio signal processing.

Time-base conversion is a technique of adjusting the length of a voice to a long or short length, so that a change in pitch or timbre is minimized in the voice after processing. Such time-base conversion technology has been widely used in the field of voice signal processing and multimedia signal processing. For example, TSM (Time-Scale Modification) is used in preprocessing to improve the recognition rate in speech recognition, or in speech synthesis, it is used to adjust the duration of speech to generate natural speech. In addition, time-base transformation techniques can be used to compress data in speech / audio coding. In addition, as an example of the multimedia signal processing, it is used to synchronize the video and audio when the video is played back quickly or slowly.

Typical overlap-and-add (OLA) -based algorithms for implementing time-base transformation techniques include synchronized overlap-and-add (SOLA), pitch synchronous overlap-and-add (PSOLA), and waveform similarity overlap-and-add ( WSOLA). All of the above described algorithms provide high quality time base transform signals. However, the SOLA-based time-base conversion technique has a disadvantage in that the length of the sound source after execution is not constant because it uses a similarity of the output signal. On the other hand, in the case of the PSOLA-based timebase transformation technology, the performance of the timebase transformation is determined by a pitch prediction algorithm. That is, when the pitch prediction algorithm having a high performance is used, the performance of the time axis transform technique is excellent, but when the performance of the pitch prediction algorithm is low, the performance of the time axis transform technique may be degraded. Therefore, a high performance pitch prediction algorithm is required, in which case an increase in the amount of calculation for pitch prediction is also required. On the other hand, the WSOLA algorithm is widely used because it provides similar sound quality to other algorithms and requires a relatively low calculation amount. Nevertheless, additional computational reduction is necessary to implement WSOLA algorithm-based time-base transformation technology in a small information environment with limited resources. Previously, in order to reduce the amount of computation, a method of reducing only the mutual similarity calculation, which takes up the majority of the computation, has been proposed. That is, a method of regularly skipping a section for calculating mutual similarity or arbitrarily setting a signal participating in the calculation. However, the conventional method has the disadvantage of not accurately finding the position having the maximum mutual similarity.

When displaying multiple layers on a screen in marine telematics systems for small and medium-sized and leisure ships, in order to efficiently display the objects of each layer, a technique is required to prevent noise from overlapping parts. As a related study, in the present invention, a time-scale modification (TSM) based on waveform similarity overlap-and-add (WSOLA) algorithm for a method of efficiently performing a superposition-based time base transformation technique in audio signal processing In order to reduce the execution time of the algorithm, a method of shortening the calculation interval of the normal cross-correlation using the normalized cross-correlation based pitch prediction has been invented.

The WSOLA algorithm-based time-base conversion technique does not take an input signal at regular intervals, and determines the optimum output signal through mutual similarity calculation to prevent phase distortion at the junction point.

Figure pat00001
Expressed as above
Figure pat00002
Is an input signal, the
Figure pat00003
Is an output signal, the
Figure pat00004
Represents a window signal. In addition,
Figure pat00005
Indicates the starting position of a continuous window,
Figure pat00006
Means the time warping function. If the window signal is arranged at regular intervals,
Figure pat00007
The
Figure pat00008
Represented by
Figure pat00009
The
Figure pat00010
Multiplied by double speed,
Figure pat00011
Is a value to prevent phase distortion.
Figure pat00012
Means a reference signal in the interval
Figure pat00013
And the meaning of the candidate signal (candidate signal)
Figure pat00014
The similarity between the two may be obtained through calculation.

In addition, the initialization operation for the input speed and the WSOLA algorithm-based time-base transformation for the first frame are performed, and the optimal moving point of the next frame is predicted using the estimated pitch, and the smallest range is calculated based on the predicted optimal moving point. A normal cross-correlation calculation is performed for each of the frames, and a pitch change of a speech signal between the frames is detected using the normal cross-correlation threshold value, and the normal cross-correlation value has a value less than or equal to the threshold value. Is calculated by extending the normal cross-correlation calculation range as a whole.

Since the speech signal is repeated every cycle, when the WSOLA algorithm is applied, the cycle of speech can be predicted by calculating the normal cross-correlation, and the optimal moving point for the next frame can be predicted by using the WSOLA algorithm. By calculating the normal cross-correlation for a small range based on the predicted best moving point, we can reduce the amount of time-base transform calculations. However, when a pitch change occurs between frames, the normal cross-correlation calculation is extended to the full range. When applying the method proposed in the present invention, there is an effect of reducing the amount of calculation by 56% compared with the conventional method.

1 is a reference diagram showing the operation of the WSOLA algorithm.
2 is a flow diagram of an algorithm in accordance with one embodiment of the present invention.
3 is a table showing a performance ratio of a proposed algorithm according to a normal cross-correlation threshold according to an embodiment of the present invention.
4 is a table showing the set values of each parameter of the WSOLA algorithm used in the performance evaluation according to an embodiment of the present invention.
5 is a table showing a WMOPS measurement results according to an embodiment of the present invention.
Figure 6 is a graph showing the sound quality preference results for male sound sources.
Figure 7 is a graph showing the sound quality preference results for female sound sources.

BRIEF DESCRIPTION OF THE DRAWINGS The foregoing and further aspects of the present invention will become more apparent from the following detailed description of preferred embodiments with reference to the accompanying drawings. Hereinafter, the present invention will be described in detail to enable those skilled in the art to easily understand and reproduce the present invention.

The WSOLA algorithm based timebase transform technique uses the similarity of the input signal to minimize the signal dislocation at the junction, which is a problem of the OLA algorithm based timebase transform technique. That is, OLA-based time-base conversion technology performs OLA with the existing output signal by windowing the input signal at regular intervals according to the double speed, which causes a problem of deterioration in sound quality due to phase distortion at the OLA junction. To solve this problem, the WSOLA algorithm-based time-base transformation technology does not take input signals at regular intervals and determines the optimum output signal through mutual similarity calculation to prevent phase distortion at the junction. The WSOLA algorithm is expressed as in Equation 1.

Figure pat00015

here

Figure pat00016
Is the input signal,
Figure pat00017
Is an output signal,
Figure pat00018
Represents a window signal. Also,
Figure pat00019
Indicates the starting position of a continuous window,
Figure pat00020
Means the time warping function. If you place the window signals at regular intervals,
Figure pat00021
The
Figure pat00022
Represented by
Figure pat00023
The
Figure pat00024
Multiplied by double speed,
Figure pat00025
Is a value to prevent phase distortion.
Figure pat00026
Means a reference signal in the interval
Figure pat00027
And a candidate signal
Figure pat00028
The mutual similarity between them is also obtained through calculation. here
Figure pat00029
Is the frame index. In addition, cross-correlation or normalized cross-correlation may be used as a method for calculating the mutual similarity.

Figure 1 shows the operation of the WSOLA algorithm for the case of double speed.

As shown,

Figure pat00030
Is the length of the OLA interval, and signal (1)
Figure pat00031
The output of the first frame.
Figure pat00032
Based on reference signals (1 ') and (2) to generate the output of the first frame
Figure pat00033
The similarity between the candidate signals in the range is calculated. Next, the output signal 2 'having the maximum value is determined, and then synthesized with the output signal 1 in the OLA method.
Figure pat00034
To determine the output of the second frame, the length of the window at (2 ')
Figure pat00035
Is separated by a reference signal. As before
Figure pat00036
based on this
Figure pat00037
The similarity calculation with the candidate signal within the range is performed to determine the output signal. In the case of the WSOLA algorithm-based time-base conversion technology, about 90% of the calculation amount is used for the similarity calculation. Therefore, in order to implement the time-base conversion technology in a small information device environment, it is necessary to effectively reduce the calculation amount of the mutual similarity. To this end, a pitch prediction based computation reduction method using the normal cross-correlation proposed by the present invention will be described.

2 shows a flow chart of the proposed algorithm.

As shown in the drawing, first, an initialization operation for input speed and a WSOLA algorithm based time axis conversion for the first frame are performed. The pitch of the speech signal is estimated using the normal cross-correlation that is calculated when performing the time-base transform on the first frame. Next, using the estimated pitch, the best moving point of the next frame is predicted and a small range based on the predicted best moving point.

Figure pat00038
Perform a normal cross-correlation calculation for. Finally, the pitch change of the speech signal between frames is detected using a normal cross-correlation threshold. If the normal cross-correlation value has a value less than or equal to the threshold value, the normal cross-correlation value is calculated by extending the calculation range as a whole. Details of each step are as follows.

Since the voice signal has a characteristic of repeating the signal according to the pitch period, when the WSOLA algorithm is operated, the period of the voice may be estimated by the distance between the reference signal and the optimal moving point as shown in Equation 2.

Figure pat00039

here,

Figure pat00040
The
Figure pat00041
Th and
Figure pat00042
Predicted from the first frame
Figure pat00043
Cycle,
Figure pat00044
X,
Figure pat00045
The
Figure pat00046
It means the best moving point of the first frame.

The period obtained in the previous section is the optimal moving point of the next frame.

Figure pat00047
It is used to predict. That is, equation (3).

Figure pat00048

here

Figure pat00049
The
Figure pat00050
The reference signal of the first frame
Figure pat00051
Normal correlation between signals located at. Also
Figure pat00052
silver
Figure pat00053
It represents the predicted optimal moving point of the second frame. Therefore, a small range based on the predicted best move point
Figure pat00054
We can also reduce the amount of computation by calculating the normal cross correlation for.

If a pitch change of the voice signal occurs between the previous frame and the current frame, there is a problem that the optimum moving point of the current frame cannot be found using the pitch obtained from the previous frame. To solve this problem, this study detects pitch change by setting the normal cross-correlation threshold. The method proposed in the study is applied only when there is no pitch change as a result of detection. If the normal cross-correlation coefficient is larger than the threshold, the normal cross-correlation calculation for the entire range is performed. The normal cross-correlation also confirmed the reduction of the calculation amount according to the threshold value and measured the ratio of frames calculated to the reduced range according to each threshold value in order to use it flexibly. For the experiment, a total of 23,992 frames extracted from 32kHz mono male and female voice DB provided by ETRI were tested.

3 shows the rate at which the proposed algorithm is performed according to the normal cross-correlation and threshold.

As shown, the lower the threshold value, the higher the rate at which the algorithm proposed in the present study is applied, thereby lowering the amount of computation. In the present invention, the experiment with the normal cross-correlation threshold set to 0.8, but this can be adjusted according to the resource environment of the device. That is, in a resource scarce environment, the threshold value may be set small to reduce accuracy, but the computation amount may be further reduced. In a resource-rich environment, the threshold value may be set high to increase accuracy.

In order to evaluate the computational and sound quality performance of the proposed method, we performed WMOPS (weighed millions operations per second) measurement and sound quality preference evaluation at 0.5x speed. When the proposed method is applied, the amount of computation is reduced and the sound quality is compared with the WSOLA algorithm which performs the normal cross-correlation calculation over the entire range. As in the previous section, 10 32 kHz mono male and female voices provided by ETRI were used. In addition, for the experiment, the set values of each parameter used in the WSOLA algorithm are shown in FIG.

5 compares the computational performance of the WSOLA algorithm with the proposed algorithm.

As shown, when the method proposed in this study was applied, the calculation amount of 51.8% for male sound source and 60.2% for female sound source could be reduced.

Next, we evaluated the sound quality preference of the proposed method and the WSOLA algorithm. Eighty males and females in their 20s and 30s who had no hearing problems participated in the evaluation of sound quality preference.

6 and 7 show evaluation results for male and female sound sources, respectively.

As shown, it can be seen that the ratio of the sound quality of the two algorithms that is similar for both the male sound source and the female sound source is 70% or more. Moreover, it can be seen that the proportion of feeling that the sound quality of the proposed method is excellent is more than 10%. Therefore, the sound quality is almost the same even if the computation amount is reduced by applying the algorithm proposed in this study.

So far I looked at the center of the preferred embodiment for the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

Claims (2)

Position of the OpenVG-based multi-layer overlapping part, characterized by performing the time-base transformation based on the waveform similarity overlap-and-add (WSOLA) algorithm to determine the optimal output signal through mutual similarity calculation and to prevent phase distortion at the junction Coordinate Determination Method. The method of claim 1,
Perform the WSOLA algorithm-based time-base transformation for the initial operation of the input speed and the first frame, and predict the optimal moving point of the next frame using the estimated pitch, and for a small range based on the predicted optimal moving point. Performing a normal cross-correlation calculation, detecting a pitch change of a speech signal between the frames using the normal cross-correlation threshold value, and when the normal cross-correlation value has a value less than or equal to the threshold value, A method for determining position coordinates of OpenVG-based multi-layer overlapping parts, which is calculated by extending the cross-correlation calculation range as a whole.
KR1020110102447A 2011-10-07 2011-10-07 Openvg based multi-layer algorithm to determine the position of the nested part KR20130037910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020110102447A KR20130037910A (en) 2011-10-07 2011-10-07 Openvg based multi-layer algorithm to determine the position of the nested part

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020110102447A KR20130037910A (en) 2011-10-07 2011-10-07 Openvg based multi-layer algorithm to determine the position of the nested part

Publications (1)

Publication Number Publication Date
KR20130037910A true KR20130037910A (en) 2013-04-17

Family

ID=48438740

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020110102447A KR20130037910A (en) 2011-10-07 2011-10-07 Openvg based multi-layer algorithm to determine the position of the nested part

Country Status (1)

Country Link
KR (1) KR20130037910A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3910627A4 (en) * 2019-01-10 2022-06-15 Tencent Technology (Shenzhen) Company Limited Keyword detection method and related device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3910627A4 (en) * 2019-01-10 2022-06-15 Tencent Technology (Shenzhen) Company Limited Keyword detection method and related device
US11749262B2 (en) 2019-01-10 2023-09-05 Tencent Technology (Shenzhen) Company Limited Keyword detection method and related apparatus

Similar Documents

Publication Publication Date Title
US8320583B2 (en) Noise reducing device and noise determining method
US8676584B2 (en) Method for time scaling of a sequence of input signal values
JPH06332492A (en) Method and device for voice detection
US8489404B2 (en) Method for detecting audio signal transient and time-scale modification based on same
US20200286504A1 (en) Sound quality prediction and interface to facilitate high-quality voice recordings
Halperin et al. Dynamic temporal alignment of speech to lips
JP4454591B2 (en) Noise spectrum estimation method, noise suppression method, and noise suppression device
JP5605574B2 (en) Multi-channel acoustic signal processing method, system and program thereof
JP6439682B2 (en) Signal processing apparatus, signal processing method, and signal processing program
JP2011237753A (en) Signal processing device, method and program
Kumar Spectral subtraction using modified cascaded median based noise estimation for speech enhancement
CN101290775B (en) Method for rapidly realizing speed shifting of audio signal
JP4965371B2 (en) Audio playback device
KR20130037910A (en) Openvg based multi-layer algorithm to determine the position of the nested part
JP5325130B2 (en) LPC analysis device, LPC analysis method, speech analysis / synthesis device, speech analysis / synthesis method, and program
US8306828B2 (en) Method and apparatus for audio signal expansion and compression
Samad et al. Pitch detection of speech signals using the cross-correlation technique
Soens et al. On split dynamic time warping for robust automatic dialogue replacement
CN113782050A (en) Sound tone changing method, electronic device and storage medium
WO2017164216A1 (en) Acoustic processing method and acoustic processing device
El-Sallam et al. Correlation based speech-video synchronization
US11107504B1 (en) Systems and methods for synchronizing a video signal with an audio signal
JP2015031913A (en) Speech processing unit, speech processing method and program
CN115206345B (en) Music and human voice separation method, device, equipment and medium based on time-frequency combination
Cheng Design of a pitch quantization and pitch correction system for real-time music effects signal processing

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination