KR20130037910A

KR20130037910A - Openvg based multi-layer algorithm to determine the position of the nested part

Info

Publication number: KR20130037910A
Application number: KR1020110102447A
Authority: KR
Inventors: 정준영; 이성로; 정민아; 박재희; 김진우; 김준석; 박선; 박희만
Original assignee: 목포대학교산학협력단
Priority date: 2011-10-07
Filing date: 2011-10-07
Publication date: 2013-04-17

Abstract

A method of determining position coordinates of an OpenVG based multi-layer overlapping portion is disclosed. This method determines the optimal output signal through the calculation of mutual similarity and performs time-base transformation based on the waveform similarity overlap-and-add (WSOLA) algorithm to prevent phase distortion at the junction. In this case, the WSOLA algorithm-based time-base transformation is performed for the initial operation of the input speed and the first frame, and the optimal moving point of the next frame is predicted using the estimated pitch, and the smallest range is estimated based on the predicted optimal moving point. A normal cross-correlation calculation is performed for the normal cross-correlation, and a pitch change of a speech signal between the frames is detected using the normal cross-correlation threshold, and when the normal cross-correlation value has a value less than or equal to the threshold value, Normal cross-correlation is also calculated by extending the calculation range as a whole.

Description

OpenVG based multi-layer algorithm to determine the position of the nested part}

The present invention relates to a method of efficiently performing a superposition-based time base transformation technique in audio signal processing.

Time-base conversion is a technique of adjusting the length of a voice to a long or short length, so that a change in pitch or timbre is minimized in the voice after processing. Such time-base conversion technology has been widely used in the field of voice signal processing and multimedia signal processing. For example, TSM (Time-Scale Modification) is used in preprocessing to improve the recognition rate in speech recognition, or in speech synthesis, it is used to adjust the duration of speech to generate natural speech. In addition, time-base transformation techniques can be used to compress data in speech / audio coding. In addition, as an example of the multimedia signal processing, it is used to synchronize the video and audio when the video is played back quickly or slowly.

Typical overlap-and-add (OLA) -based algorithms for implementing time-base transformation techniques include synchronized overlap-and-add (SOLA), pitch synchronous overlap-and-add (PSOLA), and waveform similarity overlap-and-add ( WSOLA). All of the above described algorithms provide high quality time base transform signals. However, the SOLA-based time-base conversion technique has a disadvantage in that the length of the sound source after execution is not constant because it uses a similarity of the output signal. On the other hand, in the case of the PSOLA-based timebase transformation technology, the performance of the timebase transformation is determined by a pitch prediction algorithm. That is, when the pitch prediction algorithm having a high performance is used, the performance of the time axis transform technique is excellent, but when the performance of the pitch prediction algorithm is low, the performance of the time axis transform technique may be degraded. Therefore, a high performance pitch prediction algorithm is required, in which case an increase in the amount of calculation for pitch prediction is also required. On the other hand, the WSOLA algorithm is widely used because it provides similar sound quality to other algorithms and requires a relatively low calculation amount. Nevertheless, additional computational reduction is necessary to implement WSOLA algorithm-based time-base transformation technology in a small information environment with limited resources. Previously, in order to reduce the amount of computation, a method of reducing only the mutual similarity calculation, which takes up the majority of the computation, has been proposed. That is, a method of regularly skipping a section for calculating mutual similarity or arbitrarily setting a signal participating in the calculation. However, the conventional method has the disadvantage of not accurately finding the position having the maximum mutual similarity.

When displaying multiple layers on a screen in marine telematics systems for small and medium-sized and leisure ships, in order to efficiently display the objects of each layer, a technique is required to prevent noise from overlapping parts. As a related study, in the present invention, a time-scale modification (TSM) based on waveform similarity overlap-and-add (WSOLA) algorithm for a method of efficiently performing a superposition-based time base transformation technique in audio signal processing In order to reduce the execution time of the algorithm, a method of shortening the calculation interval of the normal cross-correlation using the normalized cross-correlation based pitch prediction has been invented.

The WSOLA algorithm-based time-base conversion technique does not take an input signal at regular intervals, and determines the optimum output signal through mutual similarity calculation to prevent phase distortion at the junction point.

Expressed as above

Is an input signal, the

Is an output signal, the

Represents a window signal. In addition,

Indicates the starting position of a continuous window,

Means the time warping function. If the window signal is arranged at regular intervals,

The

Represented by

The

Multiplied by double speed,

Is a value to prevent phase distortion.

Means a reference signal in the interval

And the meaning of the candidate signal (candidate signal)

The similarity between the two may be obtained through calculation.

In addition, the initialization operation for the input speed and the WSOLA algorithm-based time-base transformation for the first frame are performed, and the optimal moving point of the next frame is predicted using the estimated pitch, and the smallest range is calculated based on the predicted optimal moving point. A normal cross-correlation calculation is performed for each of the frames, and a pitch change of a speech signal between the frames is detected using the normal cross-correlation threshold value, and the normal cross-correlation value has a value less than or equal to the threshold value. Is calculated by extending the normal cross-correlation calculation range as a whole.

Since the speech signal is repeated every cycle, when the WSOLA algorithm is applied, the cycle of speech can be predicted by calculating the normal cross-correlation, and the optimal moving point for the next frame can be predicted by using the WSOLA algorithm. By calculating the normal cross-correlation for a small range based on the predicted best moving point, we can reduce the amount of time-base transform calculations. However, when a pitch change occurs between frames, the normal cross-correlation calculation is extended to the full range. When applying the method proposed in the present invention, there is an effect of reducing the amount of calculation by 56% compared with the conventional method.

1 is a reference diagram showing the operation of the WSOLA algorithm.
2 is a flow diagram of an algorithm in accordance with one embodiment of the present invention.
3 is a table showing a performance ratio of a proposed algorithm according to a normal cross-correlation threshold according to an embodiment of the present invention.
4 is a table showing the set values of each parameter of the WSOLA algorithm used in the performance evaluation according to an embodiment of the present invention.
5 is a table showing a WMOPS measurement results according to an embodiment of the present invention.
Figure 6 is a graph showing the sound quality preference results for male sound sources.
Figure 7 is a graph showing the sound quality preference results for female sound sources.

BRIEF DESCRIPTION OF THE DRAWINGS The foregoing and further aspects of the present invention will become more apparent from the following detailed description of preferred embodiments with reference to the accompanying drawings. Hereinafter, the present invention will be described in detail to enable those skilled in the art to easily understand and reproduce the present invention.

The WSOLA algorithm based timebase transform technique uses the similarity of the input signal to minimize the signal dislocation at the junction, which is a problem of the OLA algorithm based timebase transform technique. That is, OLA-based time-base conversion technology performs OLA with the existing output signal by windowing the input signal at regular intervals according to the double speed, which causes a problem of deterioration in sound quality due to phase distortion at the OLA junction. To solve this problem, the WSOLA algorithm-based time-base transformation technology does not take input signals at regular intervals and determines the optimum output signal through mutual similarity calculation to prevent phase distortion at the junction. The WSOLA algorithm is expressed as in Equation 1.

here

Is the input signal,

Is an output signal,

Represents a window signal. Also,

Indicates the starting position of a continuous window,

Means the time warping function. If you place the window signals at regular intervals,

The

Represented by

The

Multiplied by double speed,

Is a value to prevent phase distortion.

Means a reference signal in the interval

And a candidate signal

The mutual similarity between them is also obtained through calculation. here

Is the frame index. In addition, cross-correlation or normalized cross-correlation may be used as a method for calculating the mutual similarity.

Figure 1 shows the operation of the WSOLA algorithm for the case of double speed.

As shown,

Is the length of the OLA interval, and signal (1)

The output of the first frame.

Based on reference signals (1 ') and (2) to generate the output of the first frame

The similarity between the candidate signals in the range is calculated. Next, the output signal 2 'having the maximum value is determined, and then synthesized with the output signal 1 in the OLA method.

To determine the output of the second frame, the length of the window at (2 ')

Is separated by a reference signal. As before

based on this

The similarity calculation with the candidate signal within the range is performed to determine the output signal. In the case of the WSOLA algorithm-based time-base conversion technology, about 90% of the calculation amount is used for the similarity calculation. Therefore, in order to implement the time-base conversion technology in a small information device environment, it is necessary to effectively reduce the calculation amount of the mutual similarity. To this end, a pitch prediction based computation reduction method using the normal cross-correlation proposed by the present invention will be described.

2 shows a flow chart of the proposed algorithm.

As shown in the drawing, first, an initialization operation for input speed and a WSOLA algorithm based time axis conversion for the first frame are performed. The pitch of the speech signal is estimated using the normal cross-correlation that is calculated when performing the time-base transform on the first frame. Next, using the estimated pitch, the best moving point of the next frame is predicted and a small range based on the predicted best moving point.

Perform a normal cross-correlation calculation for. Finally, the pitch change of the speech signal between frames is detected using a normal cross-correlation threshold. If the normal cross-correlation value has a value less than or equal to the threshold value, the normal cross-correlation value is calculated by extending the calculation range as a whole. Details of each step are as follows.

Since the voice signal has a characteristic of repeating the signal according to the pitch period, when the WSOLA algorithm is operated, the period of the voice may be estimated by the distance between the reference signal and the optimal moving point as shown in Equation 2.

here,

The

Th and

Predicted from the first frame

Cycle,

X,

The

It means the best moving point of the first frame.

The period obtained in the previous section is the optimal moving point of the next frame.

It is used to predict. That is, equation (3).

here

The

The reference signal of the first frame

Normal correlation between signals located at. Also

silver

It represents the predicted optimal moving point of the second frame. Therefore, a small range based on the predicted best move point

We can also reduce the amount of computation by calculating the normal cross correlation for.

If a pitch change of the voice signal occurs between the previous frame and the current frame, there is a problem that the optimum moving point of the current frame cannot be found using the pitch obtained from the previous frame. To solve this problem, this study detects pitch change by setting the normal cross-correlation threshold. The method proposed in the study is applied only when there is no pitch change as a result of detection. If the normal cross-correlation coefficient is larger than the threshold, the normal cross-correlation calculation for the entire range is performed. The normal cross-correlation also confirmed the reduction of the calculation amount according to the threshold value and measured the ratio of frames calculated to the reduced range according to each threshold value in order to use it flexibly. For the experiment, a total of 23,992 frames extracted from 32kHz mono male and female voice DB provided by ETRI were tested.

3 shows the rate at which the proposed algorithm is performed according to the normal cross-correlation and threshold.

As shown, the lower the threshold value, the higher the rate at which the algorithm proposed in the present study is applied, thereby lowering the amount of computation. In the present invention, the experiment with the normal cross-correlation threshold set to 0.8, but this can be adjusted according to the resource environment of the device. That is, in a resource scarce environment, the threshold value may be set small to reduce accuracy, but the computation amount may be further reduced. In a resource-rich environment, the threshold value may be set high to increase accuracy.

In order to evaluate the computational and sound quality performance of the proposed method, we performed WMOPS (weighed millions operations per second) measurement and sound quality preference evaluation at 0.5x speed. When the proposed method is applied, the amount of computation is reduced and the sound quality is compared with the WSOLA algorithm which performs the normal cross-correlation calculation over the entire range. As in the previous section, 10 32 kHz mono male and female voices provided by ETRI were used. In addition, for the experiment, the set values of each parameter used in the WSOLA algorithm are shown in FIG.

5 compares the computational performance of the WSOLA algorithm with the proposed algorithm.

As shown, when the method proposed in this study was applied, the calculation amount of 51.8% for male sound source and 60.2% for female sound source could be reduced.

Next, we evaluated the sound quality preference of the proposed method and the WSOLA algorithm. Eighty males and females in their 20s and 30s who had no hearing problems participated in the evaluation of sound quality preference.

6 and 7 show evaluation results for male and female sound sources, respectively.

As shown, it can be seen that the ratio of the sound quality of the two algorithms that is similar for both the male sound source and the female sound source is 70% or more. Moreover, it can be seen that the proportion of feeling that the sound quality of the proposed method is excellent is more than 10%. Therefore, the sound quality is almost the same even if the computation amount is reduced by applying the algorithm proposed in this study.

So far I looked at the center of the preferred embodiment for the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

Claims

Position of the OpenVG-based multi-layer overlapping part, characterized by performing the time-base transformation based on the waveform similarity overlap-and-add (WSOLA) algorithm to determine the optimal output signal through mutual similarity calculation and to prevent phase distortion at the junction Coordinate Determination Method.

The method of claim 1,
Perform the WSOLA algorithm-based time-base transformation for the initial operation of the input speed and the first frame, and predict the optimal moving point of the next frame using the estimated pitch, and for a small range based on the predicted optimal moving point. Performing a normal cross-correlation calculation, detecting a pitch change of a speech signal between the frames using the normal cross-correlation threshold value, and when the normal cross-correlation value has a value less than or equal to the threshold value, A method for determining position coordinates of OpenVG-based multi-layer overlapping parts, which is calculated by extending the cross-correlation calculation range as a whole.