WO2010105396A1

WO2010105396A1 - Apparatus and method for recognizing speech emotion change

Info

Publication number: WO2010105396A1
Application number: PCT/CN2009/070801
Authority: WO
Inventors: Yingliang Lu; Qing Guo; Bin Wang
Original assignee: Fujitsu Limited
Priority date: 2009-03-16
Filing date: 2009-03-16
Publication date: 2010-09-23
Also published as: CN102099853A; CN102099853B

Abstract

An apparatus and a method for recognizing a speech emotion change of a speaker from speech data of the speaker are provided, wherein the method comprises the following steps: a window dividing step (S110) of dividing the speech data of the speaker into a plurality of windows by a window width; a window speech emotion feature calculating step (S120) of calculating a speech emotion feature for each of the plurality of windows; and a speech emotion change recognizing step (S130) of recognizing the speech emotion change of the speaker for a window set consisting of at least two contiguous windows by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set.

Description

APPARATUS AND METHOD FOR RECOGNIZING SPEECH EMOTION CHANGE

Field of the Invention The present invention relates to the field of speech signal processing and in particular to an apparatus and a method for recognizing a speech emotion change of a speaker from speech data of the speaker.

Background of the Invention At present, it has become important to analyze speech data of a speaker to recognize a speech emotion of the speaker. For example, the speech emotion recognition technology may be applied to the field of human-machine interaction, and thus may greatly improve friendliness and accuracy of human-machine interaction.

Thus, various solutions for recognizing a speech emotion of a speaker from speech data of the speaker have been proposed in the prior art. For example, please see Japanese Patent Application Laid-Open No. 2008-076905 and Chinese Patent Application No. 200610097301.6.

The conventional solutions only focus on recognizing a speech emotion of a speaker by extracting speech emotion features such as pitch, energy and formant from speech data of the speaker. However, because speech emotion features of different speakers are different and even speech emotion features of the same speaker are also different at different time periods, it is difficult to accurately recognize speech emotions of personalized speech data in the conventional solutions.

On the other hand, the emotion change recognition from a speech of a speaker rather than the emotion recognition from a speech is more interesting in many applications. For example, in the application of video advertising, a time point at which an emotion of an actor is changed from "calm" to "exciting" in a video is an appropriate time point of inserting an advertisement into the video. Therefore, in such applications, it is enough to accurately recognize a speech emotion change of a speaker from speech data of the speaker. However, due to inaccuracy on speech emotion recognition in the conventional solutions, it is difficult to accurately recognize speech emotion changes of personalized speech data according to speech emotion recognition results of the conventional solutions. Summary of the Invention

Summary of the invention will be given below to provide basic understanding of some aspects of the invention. It shall be appreciated that this summary is neither exhaustively descriptive of the invention nor intended to define essential or important parts or the scope of the invention, but is merely for the purpose of presenting some concepts in a simplified form and hereby acts as a preamble of detailed description which will be discussed later.

In view of the above circumstances in the prior art, an object of the invention is to provide an apparatus and a method for recognizing a speech emotion change of a speaker from speech data of the speaker, which are capable of providing good performance on speech emotion change recognition of personalized speech data.

In order to achieve the above object, an embodiment of the invention provides a method of recognizing a speech emotion change of a speaker from speech data of the speaker, which may comprise the following steps: a window dividing step of dividing the speech data of the speaker into a plurality of windows by a window width; a window speech emotion feature calculating step of calculating a speech emotion feature for each of the plurality of windows; and a speech emotion change recognizing step of recognizing the speech emotion change of the speaker for a window set consisting of at least two contiguous windows by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set.

Furthermore, an embodiment of the invention provides an apparatus for recognizing a speech emotion change of a speaker from speech data of the speaker, which may comprise: a window dividing means for dividing the speech data of the speaker into a plurality of windows by a window width; a window speech emotion feature calculating means for calculating a speech emotion feature for each of the plurality of windows; and a speech emotion change recognizing means for recognizing the speech emotion change of the speaker for a window set consisting of at least two contiguous windows by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set. Furthermore, an embodiment of the invention provides a computer-readable storage medium with a computer program stored thereon, wherein said computer program, when being executed, causes a computer to execute the above method of recognizing a speech emotion change of a speaker from speech data of the speaker.

According to the above technical solutions of the invention, in view of the fact that a change of speech emotion such as "happy", "angry", "sad", "joy", "fearsome" and the like is always accompanied by a substantial change of speech emotion feature such as speech pitch, speech energy, speech speed or the like, by directly analyzing a speech emotion feature change in speech data of a speaker, it is possible to accurately recognize a speech emotion change of the speaker from speech data of the speaker. These and other advantages of the invention will become more apparent from the following detailed descriptions of preferred embodiments of the invention taken in conjunction with the drawings.

Brief Description of the Drawings The invention can be better understood with reference to the description given below in conjunction with the accompanying drawings, throughout which identical or like components are denoted by identical or like reference signs, and together with which the following detailed description are incorporated into and form a part of the specification and serve to further illustrate preferred embodiments of the invention and to explain principles and advantages of the invention. In the drawings:

Figure 1 is a flow chart illustrating a method of recognizing a speech emotion change of a speaker from speech data of the speaker according to an embodiment of the invention;

Figure 2 is a flow chart illustrating an implementing example of the speech emotion change recognizing step S 130 of Figure 1 ;

Figure 3 schematically illustrates waveform graphs of two speech segments of speaker A extracted from dialogue data between speakers A and B;

Figure 4 schematically illustrates pitch change graphs respectively extracted from two speech segments of Figure 3; Figure 5 schematically illustrates a pitch change graph of two windows corresponding to two speech segments of Figure 3, where the window width is the minimum length of the two speech segments and the singularities are removed;

Figure 6 schematically illustrates a pitch change graph of many windows corresponding to two speech segments of Figure 3, where the window width is 10ms and the singularities are removed;

Figure 7 illustrates an exemplary structure of a speech emotion feature change database employed in the embodiment of the invention; Figure 8 is a block diagram illustrating a construction of an apparatus for recognizing a speech emotion change of a speaker from speech data of the speaker according to an embodiment of the invention;

Figure 9 is a block diagram illustrating an exemplary construction of the speech emotion change recognizing means 830 of Figure 8; and Figure 10 is a block diagram illustrating an exemplary construction of a computer in which the invention may be implemented.

Detailed Description of the Invention

Exemplary embodiments of the present invention will be described in conjunction with the accompanying drawings hereinafter. For the sake of clarity and conciseness, not all the features of actual implementations are described in the specification. However, it is to be appreciated that, during developing any of such actual implementations, numerous implementation-specific decisions must be made to achieve the developer's specific goals. It shall further be noted that only device structures and/or processing steps closely relevant to solutions of the invention will be illustrated in the drawings while omitting other details less relevant to the invention so as not to obscure the invention due to those unnecessary details.

Figure 1 is a flow chart illustrating a method of recognizing a speech emotion change of a speaker from speech data of the speaker according to an embodiment of the invention. The speech data of the speaker may be inputted via an external device such as a sound recording device, a phone, a PDA or the like. Further, the speech data of the speaker may be a whole piece of continuous speech data from the speaker, for example, an oral lecture made by a lecturer. Alternatively, the speech data of the speaker may be constituted by one or more continuous speech segments of the speaker extracted from dialogue data of a plurality of speakers comprising the speaker, for example, one or more continuous speech segments of a customer extracted from telephone conversation data between the customer and a call center agent in the application of call center. Herein, the discrimination of different speakers may be implemented using sndpeek or the like.

For example, Figure 3 schematically illustrates waveform graphs of two speech segments (a) and (b) of speaker A extracted from dialogue data between speakers A and B. In this case, the speech data of the speaker is constituted by two speech segments (a) and (b) of the speaker A.

As illustrated in Figure 1, the method may include a window dividing step SIlO, a window speech emotion feature calculating step S 120 and a speech emotion change recognizing step S 130. First, in the window dividing step SIlO, the speech data of the speaker is divided into a plurality of windows by a window width. When the speech data of the speaker is a whole piece of continuous speech data from the speaker, the window width may be a predetermined time width such as 10ms, 100ms, Is or the like. When the speech data of the speaker is constituted by one or more continuous speech segments of the speaker, the window width may be a predetermined time width such as 10ms, 100ms, Is or the like, or may be determined by a larger one of the minimum length of the one or more continuous speech segments and a predetermined time width such as 10ms, 100ms, Is or the like.

Generally, when the speech data of the speaker is constituted by one or more continuous speech segments of the speaker, one window covers only one speech segment at most, and when one speech segment can not be fully divided, the final reminder whose length is less than the window width may be omitted.

Next, in the window speech emotion feature calculating step S 120, a speech emotion feature is calculated for each of the plurality of windows. Preferably, the speech emotion feature may comprise one or more of speech pitch, speech energy and speech speed. However, those skilled in the art should appreciate that the present invention is not limited thereto and the other speech emotion features such as formant or the like are also applicable to the present invention.

Preferably, in the window speech emotion feature calculating step S 120, an average value of the speech emotion features of respective feature extraction intervals in the window is calculated as the speech emotion feature of the window. Herein, the feature extraction interval may be set to 10ms or another value depending on a specific design. Further, those skilled in the art should appreciate that the speech emotion feature of the window may be calculated in another manner depending on a specific design. Further preferably, in order to more accurately calculate the speech emotion feature of the window, before the above average value calculating process, speech emotion feature singularities are removed from the speech emotion features of respective feature extraction intervals in the window. Herein, the speech emotion feature singularities refer to those feature values equal to or approximate to zero (for example, caused by a silence period or the like), those feature values having a large fluctuation compared with their neighboring feature values (for example, caused by a noise or the like), and so on.

Further preferably, when the calculated speech emotion feature of a window is equal to or approximate to zero (for example, the window only contain a silence period), the window may be removed.

For example, assuming that speech pitch is adopted as the speech emotion feature, taking the speech data of a speaker constituted by the speech segments (a) and (b) shown in Figure 3 as an example, the pitch graphs respectively corresponding to the speech segments (a) and (b) are schematically shown in Figure 4. When the window width is set to the minimum length of the speech segments (a) and (b), the calculated speech emotion features of the light-colored window corresponding to the speech segment (a) and the dark-colored window corresponding to the speech segment (b) is schematically shown in Figure 5. When the window width is set to a predetermined time width of 10ms, the calculated speech emotion features of the respective windows are schematically shown in Figure 6, wherein one point in the time axis represents one window and those windows whose speech emotion features are equal to or approximate to zero are removed.

Finally, in the speech emotion change recognizing step S 130, the speech emotion change of the speaker for a window set consisting of at least two contiguous windows is recognized by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set. Generally, the window set may include a predetermined number of windows, and may be sequentially selected with a moving step whose window number is less than the predetermined number. Preferably, when the speech data of the speaker is constituted by at least two continuous speech segments of the speaker, the window set may include all the windows of two successive speech segments, and may be sequentially selected with a move step of one speech segment. Further, for example, in a particular implementation of the speech emotion feature change database, one type of speech emotion change may have a predetermined number of speech emotion feature change templates, each speech emotion feature change template associates one or more representative speech emotion feature change curves (e.g., speech pitch change curve, speech energy change curve, or the like) with one type of speech emotion change, and the speech emotion feature change templates may be generated in advance through a clustering algorithm by statistical analysis of a large corpus of representative speech data from different speakers.

Figure 7 illustrates an exemplary structure of a speech emotion feature change database employed in the embodiment of the invention. As shown in Figure 7, the speech emotion feature change database includes the following two tables: a speech motion feature change type table (a) and a speech emotion feature template table (b).

The speech motion feature change type table (a) in Figure 7 has two field of "Change type ID" and "Change type name" and schematically shows four types of exemplary speech emotion changes: "Calm -> Angry", "Angry -> Calm", "Calm -> Happy", and

"Happy -> Calm". The speech emotion feature template table (b) in Figure 7 has three fields of "ID", "Feature value (pitch)" and "Change type ID" and schematically shows one exemplary speech emotion feature curve associated with the speech emotion change of "Calm -> Angry". Those skilled in the art should appreciate that the structure of the speech emotion feature change database in Figure 7 is only exemplary and the present invention is not limited thereto, and another structure may be adopted for the speech emotion feature change database depending on a specific design.

Further, the process in the speech emotion change recognizing step S 130 may be implemented by various matching algorithms. For example, Figure 2 is a flow chart illustrating an implementing example of the speech emotion change recognizing step S 130 of Figure 1. As shown in Figure 2, at the normalizing step S210, the speech emotion features of the window set are normalized. Next, at the Euclidean distance calculating step S220, an Euclidean distance between the normalized speech emotion features of the window set and each of the plurality of speech emotion feature change templates stored in the speech emotion feature change database is calculated. Then, at the determining step S230, a speech emotion feature change template whose Euclidean distance with the normalized speech emotion features of the window set is the smallest and less than a predetermined threshold is determined as the matching speech emotion feature change template. For example, the exemplary speech emotion change template in the speech emotion change template table (b) of Figure 7 is determined as the matching speech emotion feature change template of the speech data in Figure 3 through the above matching process, and thus the speech emotion feature change of the speech data in Figure 3 is recognized as "Calm -> Angry".

Preferably, in order to enhance the matching performance, the speech emotion change recognizing step S 130 in Figure 1 may be performed only if there is any one of speech emotion feature changes between neighboring windows in the window set exceeding a predetermined threshold.

Optionally, the method may further comprise a speech emotion recognizing step of recognizing speech emotions of respective windows in the window set according to a recognition result of speech emotion change in the window set. For example, when the speech emotion feature change of the speech data in Figure 3 is recognized as "Calm -> Angry", the speech emotion features of respective windows of the speech segment (a) may be recognized as "Calm" and the speech emotion features of respective windows of the speech segment (b) may be recognized as "Angry".

The method of recognizing a speech emotion change of a speaker from speech data of the speaker according to an embodiment of the invention has been detailed above with reference to the drawings. An apparatus for recognizing a speech emotion change of a speaker from speech data of the speaker according to an embodiment of the invention will be described below with reference to the drawings.

Figure 8 is a block diagram illustrating a construction of an apparatus for recognizing a speech emotion change of a speaker from speech data of the speaker according to an embodiment of the invention. As shown in Figure 8, the apparatus 800 may include a window dividing means 810, a window speech emotion feature calculating means 820 and a speech emotion change recognizing means 830.

The window dividing means 810 may divide the speech data of the speaker into a plurality of windows by a window width.

The window speech emotion feature calculating means 820 may calculate a speech emotion feature for each of the plurality of windows.

The speech emotion change recognizing means 830 may recognize the speech emotion change of the speaker for a window set consisting of at least two contiguous windows by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set.

Figure 9 is a block diagram illustrating an exemplary construction of the speech emotion change recognizing means 830 of Figure 8. In the example, the speech emotion change recognizing means 830 may include a normalizing means 910, an Euclidean distance calculating means 920 and a determining means 930. The normalizing means 910 may normalize the speech emotion features of the window set. The Euclidean distance calculating means 920 may calculate an Euclidean distance between the normalized speech emotion features of the window set and each of the plurality of speech emotion feature change templates stored in the speech emotion feature change database. The determining means 930 may determine a speech emotion feature change template whose Euclidean distance with the normalized speech emotion features of the window set is the smallest and less than a predetermined threshold as the matching speech emotion feature change template.

Optionally, the apparatus 800 may further comprise a speech emotion recognizing means for recognizing speech emotions of respective windows in the window set according to a recognition result of speech emotion change in the window set.

How to implement the functions of the respective components of the apparatus 800 in Figure 8 can be apparent upon the review of the descriptions of the corresponding processes as presented above, and therefore repeated descriptions thereof will be omitted here. As can be apparent from the above, according to the technical solution of the present invention, it is possible to accurately recognize a speech emotion change of a speaker from speech data of the speaker.

The above apparatus and method for recognizing a speech emotion change of a speaker from speech data of the speaker according to embodiments of the invention may be applied to many applications. For example, when the above apparatus and method are applied to the application of call center, a speech emotion change recognition result of a customer may be provided to a call center agent in the form of speech or image during the telephone conversion between the customer and the call center agent so that the call center agent may respond to the speech emotion change of the customer appropriately and rapidly. Furthermore, when the above apparatus and method are applied to the application of oral lecture, the desired contents of the lecture can be extracted according to a speech emotion change recognition result of the lecturer. For example, the portions of the lecture which exhibit the speech emotion of "sad" may be filtered out so as to extract the optimistic contents of the lecture. The above method and apparatus may be implemented by hardware. Such hardware may be a single processing device or a plurality of processing devices. Such processing device may be a microprocessor, a microcontroller, a digital processor, a microcomputer, a part of a central processing unit, a state machine, a logic circuit and/or any device capable of manipulating a signal. Also, it should be noted that, the above method and apparatus may be implemented by either software or firmware. In the case where the above method and apparatus are implemented by software, a program that constitutes the software is installed, from a storage medium or a network, into a computer having a dedicated hardware configuration, e. g., a general - purpose personal computer 1000 as illustrated in Figure 10, that when various programs are installed therein, becomes capable of performing various functions, or the like.

In Figure 10, a central processing unit (CPU) 1001 performs various processes in accordance with a program stored in a read only memory (ROM) 1002 or a program loaded from a storage section 1008 to a random access memory (RAM) 1003. In the RAM 1003, data required when the CPU 1001 performs the various processes or the like is also stored as required.

The CPU 1001, the ROM 1002 and the RAM 1003 are connected to one another via a bus 1004. An input / output interface 1005 is also connected to the bus 1004. The following components are connected to input / output interface 1005: An input section 1006 including a keyboard, a mouse, or the like; An output section 1007 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a loudspeaker or the like; The storage section 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs a communication process via the network such as the internet.

A drive 1010 is also connected to the input / output interface 1005 as required.

A removable medium 1011, such as a magnetic disk, an optical disk, a magneto - optical disk, a semiconductor memory, or the like, is mounted on the drive 1010 as required, so that a computer program read therefrom is installed into the storage section 1008 as required.

In the case where the above - described series of processes are implemented by the software, the program that constitutes the software is installed from the network such as the internet or the storage medium such as the removable medium 1011. One skilled in the art should note that, this storage medium is not limit to the removable medium 1011 having the program stored therein as illustrated in Figure 10, which is delivered separately from the device for providing the program to the user.

Examples of the removable medium 1011 include the magnetic disk (including a floppy disk (register trademark )), the optical disk (including a compact disk - read only memory (CD-ROM) and a digital versatile disk (DVD)), the magneto - optical disk

(including a mini - disk (MD) (register trademark)), and the semiconductor memory.

Alternatively, the storage medium may be the ROM 1002, the hard disk contained in the storage section 1008, or the like, which have the program stored therein and is delivered to the user together with the device that containing them.

It should also be noted that the step in which the above - described series of processes are performed may naturally be performed chronologically in order of description but needed not be performed chronologically. Some steps may be performed in parallel or independently of one another. Although illustrative embodiments have been described herein, it should be understood that various other changes, replacements and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention. Further more, the present application is not limited to the above-described specific embodiments for processes, devices, manufactures, structures of substances, means, methods and steps. One skilled in the art will understand from the disclosure of the present invention that, according to the present invention, it is possible to use existing processes, devices, manufactures, structures of substances, means, methods or steps and those to be developed in the future which perform substantially the same functions with the above-described embodiments or obtain substantially the same results. Therefore, the appended claims are intended to cover in their scopes such processes, devices, manufactures, structures of substances, means, methods or steps.

Claims

1. A method of recognizing a speech emotion change of a speaker from speech data of the speaker, comprising the following steps: a window dividing step of dividing the speech data of the speaker into a plurality of windows by a window width; a window speech emotion feature calculating step of calculating a speech emotion feature for each of the plurality of windows; and a speech emotion change recognizing step of recognizing the speech emotion change of the speaker for a window set consisting of at least two contiguous windows by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set.

2. The method according to claim 1, wherein the speech data of the speaker is constituted by one or more continuous speech segments of the speaker extracted from dialogue data of a plurality of speakers comprising the speaker.

3. The method according to claim 1 or 2, wherein the window width is a predetermined time width.

4. The method according to claim 2, wherein the window width is determined by a larger one of the minimum length of the one or more continuous speech segments and a predetermined time width.

5. The method according to claim 1, wherein the speech emotion feature comprises one or more of speech pitch, speech energy and speech speed.

6. The method according to claim 1, wherein the window speech emotion feature calculating step comprises an average value calculating step of calculating an average value of speech emotion features of respective feature extraction intervals in the window as the speech emotion feature of the window.

7. The method according to claim 6, wherein the window speech emotion feature calculating step further comprises a singularity removing step of removing speech emotion feature singularities from the speech emotion features of respective feature extraction intervals in the window, before the average value calculating step.

8. The method according to claim 1, wherein a speech emotion change recognizing step further comprises the following steps: a normalizing step of normalizing the speech emotion features of the window set; an Euclidean distance calculating step of calculating an Euclidean distance between the normalized speech emotion features of the window set and each of the plurality of speech emotion feature change templates stored in the speech emotion feature change database; and a determining step of determining a speech emotion feature change template whose Euclidean distance with the normalized speech emotion features of the window set is the smallest and less than a predetermined threshold as the matching speech emotion feature change template.

9. The method according to claim 1, wherein the speech emotion change recognizing step is performed only if there is any one of speech emotion feature changes between neighboring windows in the window set exceeding a predetermined threshold.

10. The method according to claim 1, further comprising a speech emotion recognizing step of recognizing speech emotions of respective windows in the window set according to a recognition result of speech emotion change in the window set.

11. An apparatus for recognizing a speech emotion change of a speaker from speech data of the speaker, comprising: a window dividing means for dividing the speech data of the speaker into a plurality of windows by a window width; a window speech emotion feature calculating means for calculating a speech emotion feature for each of the plurality of windows; and a speech emotion change recognizing means for recognizing the speech emotion change of the speaker for a window set consisting of at least two contiguous windows by comparing the speech emotion features of the window set with each of a plurality of speech emotion feature change templates stored in a speech emotion feature change database to find out a speech emotion feature change template which matches the speech emotion features of the window set.

12. The apparatus according to claim 11, wherein the speech data of the speaker is constituted by one or more continuous speech segments of the speaker extracted from dialogue data of a plurality of speakers comprising the speaker.

13. The apparatus according to claim 11 or 12, wherein the window width is a predetermined time width.

14. The apparatus according to claim 12, wherein the window width is determined by a larger one of the minimum length of the one or more continuous speech segments and a predetermined time width.

15. The apparatus according to claim 11, wherein the speech emotion feature comprises one or more of speech pitch, speech energy and speech speed.

16. The apparatus according to claim 11, wherein the window speech emotion feature calculating means comprises an average value calculating means for calculating an average value of speech emotion features of respective feature extraction intervals in the window as the speech emotion feature of the window.

17. The apparatus according to claim 16, wherein the window speech emotion feature calculating means further comprises a singularity removing means for removing speech emotion feature singularities from the speech emotion features of respective feature extraction intervals in the window, before the process in the average value calculating means is performed.

18. The apparatus according to claim 11, wherein a speech emotion change recognizing means further comprises: a normalizing means for normalizing speech emotion features of the window set; an Euclidean distance calculating means for calculating an Euclidean distance between the normalized speech emotion features of the window set and each of the plurality of speech emotion feature change templates stored in the speech emotion feature change database; and a determining means for determining a speech emotion feature change template whose Euclidean distance with the normalized speech emotion features of the window set is the smallest and less than a predetermined threshold as the matching speech emotion feature change template.

19. The apparatus according to claim 11, wherein the process in the speech emotion change recognizing means is performed only if there is any one of speech emotion feature changes between neighboring windows in the window set exceeding a predetermined threshold.

20. The apparatus according to claim 11, further comprising a speech emotion recognizing means for recognizing speech emotions of respective windows in the window set according to a recognition result of speech emotion change in the window set.

21. A computer-readable storage medium with a computer program stored thereon, wherein said computer program, when being executed, causes a computer to execute the method according to any of claims 1 to 10.