US20220277761A1 - Impression estimation apparatus, learning apparatus, methods and programs for the same - Google Patents

Impression estimation apparatus, learning apparatus, methods and programs for the same Download PDF

Info

Publication number
US20220277761A1
US20220277761A1 US17/630,855 US201917630855A US2022277761A1 US 20220277761 A1 US20220277761 A1 US 20220277761A1 US 201917630855 A US201917630855 A US 201917630855A US 2022277761 A1 US2022277761 A1 US 2022277761A1
Authority
US
United States
Prior art keywords
feature amount
learning
impression
voice signal
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/630,855
Inventor
Hosana KAMIYAMA
Atsushi Ando
Satoshi KOBASHIKAWA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION reassignment NIPPON TELEGRAPH AND TELEPHONE CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ANDO, ATSUSHI, KAMIYAMA, Hosana, KOBASHIKAWA, Satoshi
Publication of US20220277761A1 publication Critical patent/US20220277761A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/75Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 for modelling vocal tract parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to an impression estimation technique of estimating an impression that a voice signal gives to a listener.
  • An impression estimation technique capable of estimating an impression of an emergency degree or the like of a person making a phone call in an answering machine message or the like is needed. For example, when the impression of the emergency degree can be estimated using the impression estimation technique, a user can select an answering machine message with a high emergency degree without actually listening to the answering machine message.
  • Non-Patent Literature 1 As the impression estimation technique, Non-Patent Literature 1 is known.
  • an impression is estimated from vocal tract feature amounts such as MFCC (Mel-Frequency Cepstrum Coefficients) or PNCC (Power Normalized Cepstral Coefficients) and metrical features regarding a pitch and intensity of voice.
  • MFCC Mel-Frequency Cepstrum Coefficients
  • PNCC Power Normalized Cepstral Coefficients
  • metrical features regarding a pitch and intensity of voice metrical features regarding a pitch and intensity of voice.
  • an impression is estimated using an average speech speed as a feature amount.
  • Non-Patent Literature 1 E. Principi et al., “Acoustic template-matching for automatic emergency state detection: An ELM based algorithm”, Neurocomputing, vol. 52, No. 3, p. 1185-1194, 2011.
  • Non-Patent Literature 2 Inanogliu et al., “Emotive Alert: HMM-Based Emotion Detection In Voicemail Message”, IUI 05, 2005.
  • a rhythm of speech is different since the impression is different depending on the impression of an estimation object.
  • the estimation object is the impression of an emergency degree
  • the rhythm of the speech in the case where the emergency degree is high and the rhythm of the speech in the case where the emergency degree is low are different. Therefore, a method of estimating the impression using the rhythm of the speech is possible, however, the speech speed of voice is needed at the time.
  • the voice recognition is needed.
  • An object of the present invention is to provide an impression estimation technique which does not require voice recognition.
  • an impression estimation device includes an estimation unit configured to estimate an impression of a voice signal s by defining p 1 ⁇ p 2 and using a first feature amount obtained based on a first analysis time length p 1 for the voice signal s and a second feature amount obtained based on a second analysis time length p 2 for the voice signal s.
  • a learning device includes a learning unit configured to learn an estimation model which estimates the impression of the voice signal by defining p 1 ⁇ p 2 and using a first feature amount for learning obtained based on a first analysis time length p 1 for a voice signal for learning s L , a second feature amount for learning obtained based on a second analysis time length p 2 for the voice signal for learning s L , and an impression label imparted to the voice signal for learning s L .
  • an effect of being capable of estimating an impression of speech without requiring voice recognition is accomplished.
  • FIG. 2 is a diagram illustrating an example of a processing flow of the impression estimation device relating to the first embodiment.
  • FIG. 3 is a diagram illustrating an example of a feature amount F 1 (i).
  • FIG. 4 is a diagram illustrating a transition example of a second feature amount for which an analysis window is made long.
  • FIG. 5 is a functional block diagram of a learning device relating to the first embodiment.
  • FIG. 6 is a diagram illustrating an example of a processing flow of the learning device relating to the first embodiment.
  • FIG. 7 is a functional block diagram of the impression estimation device relating to a second embodiment.
  • FIG. 8 is a diagram illustrating an example of a processing flow of the impression estimation device relating to the second embodiment.
  • FIG. 9 is a functional block diagram of the learning device relating to the second embodiment.
  • FIG. 10 is a diagram illustrating an example of a processing flow of the learning device relating to the second embodiment.
  • FIG. 11 is a diagram illustrating an experimental result.
  • FIG. 12 is a diagram illustrating a configuration example of a computer which functions as the impression estimation device or the learning device.
  • an analysis window of a long analysis time length by using an analysis window of a long analysis time length, an overall fluctuation of voice is captured.
  • a rhythm of the voice is extracted and an impression is estimated without using voice recognition.
  • FIG. 1 illustrates a functional block diagram of an impression estimation device relating to the first embodiment
  • FIG. 2 illustrates the processing flow.
  • An impression estimation device 100 includes a first section segmentation unit 111 , a first feature amount extraction unit 112 , a first feature amount vector conversion unit 113 , a second section segmentation unit 121 , a second feature amount extraction unit 122 , a second feature amount vector conversion unit 123 , a connection unit 130 , and an impression estimation unit 140 .
  • T is a total sample number of the voice signal s of the estimation object
  • the impression estimation device and a learning device are special devices configured by reading a special program in a known or exclusive computer including a central processing unit (CPU: Central Processing Unit) and a main storage (RAM: Random Access Memory) or the like, for example.
  • the impression estimation device and the learning device execute each processing under control of the central processing unit.
  • Data inputted to the impression estimation device and the learning device and data obtained in each processing are stored in the main storage for example, and the data stored in the main storage is read out to the central processing unit as needed and utilized in other processings.
  • Respective processing units of the impression estimation device and the learning device may be at least partially configured by hardware such as an integrated circuit.
  • Respective storage units included in the impression estimation device and the learning device can be configured by the main storage such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store, for example.
  • the respective storage units are not always needed to be provided inside the impression estimation device and the learning device, and may be configured by an auxiliary storage configured by a semiconductor memory device such as a hard disk, an optical disk or a flash memory, and provided outside the impression estimation device and the learning device.
  • the analysis section w 1 (i,j) can be expressed as follows for example.
  • I 1 is a total number of analysis sections when segmenting the voice signal of the estimation object by the analysis time length p 1 and the shift width s 1 .
  • the analysis section w 1 (i,j) may be multiplied with a window function of a Hamming window or the like.
  • I 2 is the total number of the analysis sections when segmenting the voice signal of the estimation object by the analysis time length p 2 and the shift width s 2 .
  • the analysis window width p 2 a value to be p 1 ⁇ p 2 is set.
  • the larger analysis window width p 2 makes it easier to analyze a rhythm change of sound since analysis time is long.
  • a sampling frequency of voice is 16000 Hz
  • the first feature amount extraction unit 112 receives the analysis section w 1 (i,j) as the input, extracts a feature amount f 1 (i,k) from the analysis section w 1 (i,j) (S 112 ), and outputs it.
  • the feature amount As the feature amount, MFCC which expresses a vocal tract characteristic of the voice, F 0 extraction which expresses the pitch of the voice and power which expresses volume of the voice or the like are possible.
  • the feature amounts may be extracted using a known method.
  • the first feature amount extraction unit 112 extracts the feature amount regarding at least either of the vocal tract and the pitch of the voice.
  • the second feature amount extraction unit 122 receives the analysis section w 2 (i′,j′) as the input, extracts a feature amount f 2 (i′,k′) from the analysis section w 2 (i′,j′) (S 122 ), and outputs it.
  • k′ 1, 2, . . . , K 2 .
  • p 1 ⁇ p 2 holds, as the feature amount, the feature amount which captures the overall change such as EMS (Envelope Modulation Spectra) (Reference Literature 1) is possible.
  • the second feature amount extraction unit 122 extracts the feature amount regarding the rhythm of the voice signal.
  • p 2 of the second section segmentation unit 121 is set so as to extract the feature amount regarding the rhythm of the voice signal in the second feature amount extraction unit 122
  • p 1 of the first section segmentation unit 111 is set so as to extract the feature amount regarding at least either of the vocal tract and the pitch of the voice in the first feature amount extraction unit 112 .
  • the first feature amount vector conversion unit 113 receives the feature amount f 1 (i,k) as the input, converts the feature amount f 1 (i,k) to a feature amount vector V 1 which contributes to determination of the emergency degree (S 113 ), and outputs it. Conversion to the feature amount vector is performed by a known technique such as acquisition of statistics of the mean and variance or the like of a feature amount series or a method of converting time sequential data to the vector by a neural network (LSTM (Long short-term memory) or the like).
  • LSTM Long short-term memory
  • vectorization is possible as follows.
  • the method similar to that of the first feature amount vector conversion unit 113 may be used or a different method may be used.
  • connection unit 130 can perform connection by addition or the like when the dimensional numbers K 1 and K 2 are the same.
  • the impression estimation unit 140 receives the connected vector V as the input, estimates whether the voice signal s is the emergency or the non-emergency from the connected vector V (S 140 ), and outputs the estimated value c (emergency label).
  • a class of the emergency and the non-emergency is estimated by a general machine learning method of SVM (Support Vector Machine), Random Forest or the neural network or the like. While an estimation model needs to be learned beforehand upon estimation, learning data is prepared and learning is performed by a general method. The learning device which learns the estimation model will be described later.
  • the estimation model is a model which inputs the connected vector V and outputs the estimated value of the impression of the voice signal. For example, the impression of the estimation object is the emergency or the non-emergency. That is, the impression estimation unit 140 turns the connected vector V to the input of the estimation model and obtains the estimated value which is the output of the estimation model.
  • FIG. 4 is a first main component when main component analysis is performed for the EMS. While the voice in the emergency irregularly changes, the voice in the non-emergency stably vibrates. By using the long-time analysis window in this way, it is recognized that a difference in the rhythm appears in the second feature amount.
  • the impression can be estimated without obtaining the speech speed and a voice recognition result by obtaining the rhythm of the speech as the feature amount in the long-time analysis section of the present embodiment, in addition to the feature that the pitch of the voice becomes high and the feature that the intensity becomes high in the case of the voice in the emergency, that have been used in the prior art.
  • FIG. 5 illustrates a functional block diagram of the learning device relating to the first embodiment
  • FIG. 6 illustrates the processing flow.
  • the learning device 200 includes a first section segmentation unit 211 , a first feature amount extraction unit 212 , a first feature amount vector conversion unit 213 , a second section segmentation unit 221 , a second feature amount extraction unit 222 , a second feature amount vector conversion unit 223 , a connection unit 230 , and a learning unit 240 .
  • the learning device 200 receives a voice signal for learning s L and an impression label for learning c L as the input, learns the estimation model which estimates the impression of the voice signal, and outputs the learned estimation model.
  • the impression label c L may be manually imparted before learning or may be obtained beforehand from a voice signal for learning s L by some means and imparted.
  • the first section segmentation unit 211 , the first feature amount extraction unit 212 , the first feature amount vector conversion unit 213 , the second section segmentation unit 221 , the second feature amount extraction unit 222 , the second feature amount vector conversion unit 223 and the connection unit 230 perform processing S 211 , S 212 , S 213 , S 221 , S 222 , S 223 and S 230 similar to the processing S 111 , S 112 , S 113 , S 121 , S 122 , S 123 and S 130 of the first section segmentation unit 111 , the first feature amount extraction unit 112 , the first feature amount vector conversion unit 113 , the second section segmentation unit 121 , the second feature amount extraction unit 122 , the second feature amount vector conversion unit 123 and the connection unit 130 , respectively.
  • the processing is performed to the voice signal for learning s L and information originated from the voice signal for learning s L , instead of the voice signal s and information originated from the voice signal s.
  • the learning unit 240 receives a connected vector V L and the impression label c L as the input, learns the estimation model which estimates the impression of the voice signal (S 240 ), and outputs the learned estimation model.
  • the estimation model may be learned by the general machine learning method of the SVM (Support Vector Machine), the Random Forest or the neural network or the like.
  • the impression can be estimated with free speech content without the need of the voice recognition.
  • the first feature amount vector conversion unit 113 , the second feature amount vector conversion unit 123 , the connection unit 130 and the impression estimation unit 140 of the present embodiment may be expressed by one neural network.
  • the entire neural network may be referred to as an estimation unit.
  • the first feature amount vector conversion unit 113 , the second feature amount vector conversion unit 123 , the connection unit 130 and the impression estimation unit 140 of the present embodiment may be referred to as the estimation unit altogether.
  • the estimation unit estimates the impression of the voice signal s using the first feature amount f 1 (i,k) obtained based on the analysis time length p 1 for the voice signal s and the second feature amount f 2 (i′,k′) obtained based on the analysis time length p 2 for the voice signal s.
  • the first feature amount vector conversion unit 213 , the second feature amount vector conversion unit 223 , the connection unit 230 and the learning unit 240 may be expressed by one neural network to perform learning.
  • the entire neural network may be referred to as the learning unit.
  • the first feature amount vector conversion unit 213 , the second feature amount vector conversion unit 223 , the connection unit 230 and the learning unit 240 of the present embodiment may be referred to as the learning unit altogether.
  • the learning unit learns the estimation model which estimates the impression of the voice signal using the first feature amount for learning f 1,L (i,k) obtained based on the first analysis time length p 1 for the voice signal for learning s L , the second feature amount for learning f 2,L (i′,k′) obtained based on the second analysis time length p 2 for the voice signal for learning s L , and the impression label c L imparted to the voice signal for learning s L .
  • the impression of the emergency degree is estimated in the present embodiment, even the impression of something other than the emergency degree can be the object of the estimation as long as it is the impression in which the rhythm is changed by the difference of the impression.
  • the emergency degree is estimated using long-time feature amount statistics.
  • FIG. 7 illustrates a functional block diagram of the impression estimation device relating to the second embodiment
  • FIG. 8 illustrates the processing flow.
  • An impression estimation device 300 includes the first section segmentation unit 111 , the first feature amount extraction unit 112 , the first feature amount vector conversion unit 113 , a statistic calculation unit 311 , a third feature amount vector conversion unit 323 , the connection unit 130 and the impression estimation unit 140 .
  • the second section segmentation unit 121 , the second feature amount extraction unit 122 and the second feature amount vector conversion unit 123 are removed from the impression estimation device 100 , and the statistic calculation unit 311 and the third feature amount vector conversion unit 323 are added.
  • the other configuration is similar to the first embodiment.
  • i′′ is an index of the statistic
  • p 3 is a sample number when calculating the statistic from the feature amount f 1 (i,k)
  • s 3 is a shift width when calculating the statistic from the feature amount f 1 (i,k)
  • I 3 is the total number of calculating the statistic.
  • a value to be p 3 >2 is set. When p 3 >2 holds, p 3 pieces of the feature amount f 1 (i,k) are used, the analysis time becomes s 1 ⁇ (p 3 ⁇ 1)+p 1 and longer than p 1 , and it becomes easy to analyze the rhythm change of the sound.
  • the analysis time length s 1 ⁇ (p 3 ⁇ 1)+p 1 corresponds to the analysis time p 2 in the first embodiment.
  • the statistic calculation unit 311 performs the analysis of the long-time window width and conversion to the feature amount regarding the rhythm similar to the first embodiment by calculating the statistic for the window width s 1 ⁇ (p 3 ⁇ 1)+p 1 of a fixed section based on the feature amount f 1 (i,k) obtained by the analysis of a short-time window width.
  • a mean ‘mean’, a standard deviation ‘std’, a maximum value ‘max’, a kurtosis ‘kurtosis’, skewness ‘skewness’ and a mean absolute deviation ‘mad’ can be obtained, and a computation expression is as follows, respectively.
  • f 3 ( i′′,k ) [mean( i′′,F 1 ( k )), std ( i′′,F 1 ( k )),max( i′′,F 1 ( k )),kurtosis( i′′,F 1 ( k )),skewness( i′′,F 1 ( k )), mad ( i′′,F 1 ( k ))]
  • the statistic becomes the feature amount indicating the degree of the change of the sound in the respective sections, when MFCC is used for example, and the change degree becomes the feature amount related to the rhythm.
  • the vectorization is made possible. For example, in the case of taking the mean and the variance, the vectorization is possible as follows.
  • connection unit 130 performs the processing S 130 by using the feature amount vector V 3 instead of the feature amount vector V 2 .
  • FIG. 9 illustrates a functional block diagram of the learning device relating to the second embodiment
  • FIG. 10 illustrates the processing flow.
  • the learning device 400 includes the first section segmentation unit 211 , the first feature amount extraction unit 212 , the first feature amount vector conversion unit 213 , a statistic calculation unit 411 , a third feature amount vector conversion unit 423 , the connection unit 230 and the learning unit 240 .
  • the learning device 400 receives a voice signal for learning s L (t) and the impression label for learning c L as the input, learns the estimation model which estimates the impression of the voice signal, and outputs the learned estimation model.
  • the statistic calculation unit 411 and the third feature amount vector conversion unit 423 perform processing S 411 and S 423 similar to the processing S 311 and S 323 of the statistic calculation unit 311 and the third feature amount vector conversion unit 323 , respectively.
  • the processing is performed to the voice signal for learning s L (t) and information originated from the voice signal for learning s L (t), instead of the voice signal s (t) and information originated from the voice signal s (t).
  • the other configuration is as described in the first embodiment.
  • the connection unit 230 performs the processing S 230 using the feature amount vector V 3,L instead of the feature amount vector V 2,L .
  • the first embodiment and the second embodiment may be combined.
  • the impression estimation device 300 includes the second section segmentation unit 121 , the second feature amount extraction unit 122 and the second feature amount vector conversion unit 123 in addition to the configuration of the second embodiment.
  • the impression estimation device 300 performs S 121 , S 122 and S 123 in addition to the processing in the second embodiment.
  • the learning device 400 includes the second section segmentation unit 221 , the second feature amount extraction unit 222 and the second feature amount vector conversion unit 223 in addition to the configuration of the second embodiment.
  • the learning device 400 performs S 221 , S 222 and S 223 in addition to the processing in the second embodiment.
  • FIG. 11 illustrates results in the case with no second feature amount extraction unit, in the case of the first embodiment, in the case of the second embodiment and in the case of the modification 1 of the second embodiment.
  • the effect of the long-time feature amount by the first embodiment and the second embodiment is greater than that in the case of only the first feature amount.
  • first embodiment and the second embodiment may be used separately according to a language.
  • the impression estimation device receives language information indicating a kind of the language as the input, estimates the impression in the first embodiment at the time of a certain language A, and estimates the impression in the second embodiment at the time of another language B.
  • the estimation accuracy of which embodiment is higher is determined beforehand for each language, and the embodiment with the higher accuracy is selected according to the language information at the time of the estimation.
  • the language information may be estimated from the voice signal s (t) or may be inputted by a user.
  • the present invention is not limited to the embodiments and modifications described above.
  • the various kinds of processing described above are not only time-sequentially executed according to the description but may also be executed in parallel or individually according to throughput of the device which executes the processing or needs.
  • appropriate changes are possible without departing from the purpose of the present invention.
  • the various kinds of processing described above can be executed by making a recording unit 2020 of a computer illustrated in FIG. 12 read the program of executing respective steps of the method described above and making a control unit 2010 , an input unit 2030 and an output unit 2040 or the like perform operations.
  • the program in which the processing content is described can be recorded in a computer-readable recording medium.
  • the computer-readable recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium and a semiconductor memory or the like.
  • the program is distributed by selling, assigning or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded, for example.
  • the program may be distributed by storing the program in a storage of a server computer and transferring the program from the server computer to another computer via a network.
  • the computer executing such a program tentatively stores the program recorded in the portable recording medium or the program transferred from the server computer in its own storage first, for example. Then, when executing the processing, the computer reads the program stored in its own recording medium, and executes the processing according to the read program.
  • the computer may directly read the program from the portable recording medium and execute the processing according to the program, and further, every time the program is transferred from the server computer to the computer, the processing according to the received program may be successively executed.
  • the processing described above may be executed by a so-called ASP (Application Service Provider) type service which achieves a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to the computer.
  • ASP Application Service Provider
  • the program in the present embodiment includes the information which is provided for the processing by an electronic computer and which is equivalent to the program (data which is not a direct command to the computer but has a property of stipulating the processing of the computer or the like).
  • the present device is configured by executing a predetermined program on the computer in the present embodiment, at least part of the processing content may be achieved in a hardware manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An impression estimation technique without the need of voice recognition is provided. An impression estimation device includes an estimation unit configured to estimate an impression of a voice signal s by defining p1<p2 and using a first feature amount obtained based on a first analysis time length p1 for the voice signal s and a second feature amount obtained based on a second analysis time length p2 for the voice signal s. A learning device includes a learning unit configured to learn an estimation model which estimates the impression of the voice signal by defining p1<p2 and using a first feature amount for learning obtained based on the first analysis time length p1 for a voice signal for learning sL, a second feature amount for learning obtained based on the second analysis time length p2 for the voice signal for learning sL, and an impression label imparted to the voice signal for learning sL.

Description

    TECHNICAL FIELD
  • The present invention relates to an impression estimation technique of estimating an impression that a voice signal gives to a listener.
  • BACKGROUND ART
  • An impression estimation technique capable of estimating an impression of an emergency degree or the like of a person making a phone call in an answering machine message or the like is needed. For example, when the impression of the emergency degree can be estimated using the impression estimation technique, a user can select an answering machine message with a high emergency degree without actually listening to the answering machine message.
  • As the impression estimation technique, Non-Patent Literature 1 is known. In Non-Patent Literature 1, an impression is estimated from vocal tract feature amounts such as MFCC (Mel-Frequency Cepstrum Coefficients) or PNCC (Power Normalized Cepstral Coefficients) and metrical features regarding a pitch and intensity of voice. In addition, in Non-Patent Literature 2, an impression is estimated using an average speech speed as a feature amount.
  • CITATION LIST Non-Patent Literature
  • Non-Patent Literature 1: E. Principi et al., “Acoustic template-matching for automatic emergency state detection: An ELM based algorithm”, Neurocomputing, vol. 52, No. 3, p. 1185-1194, 2011.
  • Non-Patent Literature 2: Inanogliu et al., “Emotive Alert: HMM-Based Emotion Detection In Voicemail Message”, IUI 05, 2005.
  • SUMMARY OF THE INVENTION Technical Problem
  • An impression is estimated using speech content or the like in the prior art, however, in a case where an estimated result depends on the speech content or a speech language, voice recognition is needed.
  • There is a case where a rhythm of speech is different since the impression is different depending on the impression of an estimation object. For example, when the estimation object is the impression of an emergency degree, the rhythm of the speech in the case where the emergency degree is high and the rhythm of the speech in the case where the emergency degree is low are different. Therefore, a method of estimating the impression using the rhythm of the speech is possible, however, the speech speed of voice is needed at the time. Here, in order to obtain the speech speed, the voice recognition is needed.
  • However, since the voice recognition often includes recognition errors, an impression estimation technique which does not require the voice recognition is needed.
  • An object of the present invention is to provide an impression estimation technique which does not require voice recognition.
  • Means for Solving the Problem
  • In order to solve the problem described above, according to an aspect of the present invention, an impression estimation device includes an estimation unit configured to estimate an impression of a voice signal s by defining p1<p2 and using a first feature amount obtained based on a first analysis time length p1 for the voice signal s and a second feature amount obtained based on a second analysis time length p2 for the voice signal s.
  • In order to solve the problem described above, according to another aspect of the present invention, a learning device includes a learning unit configured to learn an estimation model which estimates the impression of the voice signal by defining p1<p2 and using a first feature amount for learning obtained based on a first analysis time length p1 for a voice signal for learning sL, a second feature amount for learning obtained based on a second analysis time length p2 for the voice signal for learning sL, and an impression label imparted to the voice signal for learning sL.
  • Effects of the Invention
  • According to the present invention, an effect of being capable of estimating an impression of speech without requiring voice recognition is accomplished.
  • BRIEF DESCRIPTION OF DRAWINGS
  • FIG. 1 is a functional block diagram of an impression estimation device relating to a first embodiment.
  • FIG. 2 is a diagram illustrating an example of a processing flow of the impression estimation device relating to the first embodiment.
  • FIG. 3 is a diagram illustrating an example of a feature amount F1(i).
  • FIG. 4 is a diagram illustrating a transition example of a second feature amount for which an analysis window is made long.
  • FIG. 5 is a functional block diagram of a learning device relating to the first embodiment.
  • FIG. 6 is a diagram illustrating an example of a processing flow of the learning device relating to the first embodiment.
  • FIG. 7 is a functional block diagram of the impression estimation device relating to a second embodiment.
  • FIG. 8 is a diagram illustrating an example of a processing flow of the impression estimation device relating to the second embodiment.
  • FIG. 9 is a functional block diagram of the learning device relating to the second embodiment.
  • FIG. 10 is a diagram illustrating an example of a processing flow of the learning device relating to the second embodiment.
  • FIG. 11 is a diagram illustrating an experimental result.
  • FIG. 12 is a diagram illustrating a configuration example of a computer which functions as the impression estimation device or the learning device.
  • DESCRIPTION OF EMBODIMENTS
  • Hereinafter, the embodiments of the present invention will be described. Note that, on the drawings used for the description below, same signs are noted for configuration units having the same function and steps of performing same processing, and redundant description is omitted. In the description below, the processing performed in respective element units of vectors or matrixes is applied to all elements of the vectors and the matrixes unless otherwise specified.
  • <Point of First Embodiment>
  • In the present embodiment, by using an analysis window of a long analysis time length, an overall fluctuation of voice is captured. Thus, a rhythm of the voice is extracted and an impression is estimated without using voice recognition.
  • First Embodiment
  • FIG. 1 illustrates a functional block diagram of an impression estimation device relating to the first embodiment, and FIG. 2 illustrates the processing flow.
  • An impression estimation device 100 includes a first section segmentation unit 111, a first feature amount extraction unit 112, a first feature amount vector conversion unit 113, a second section segmentation unit 121, a second feature amount extraction unit 122, a second feature amount vector conversion unit 123, a connection unit 130, and an impression estimation unit 140.
  • The impression estimation device 100 receives a voice signal s=[s(1), s(2), . . . , s(t), . . . , s(T)] as input, estimates the impression of the voice signal s, and outputs an estimated value c. In the present embodiment, the impression of an estimation object is defined as an emergency degree, and an emergency degree label which takes c=1 when it is estimated that the impression of the voice signal s is emergency and takes c=2 when it is estimated that the impression of the voice signal s is non-emergency is used as the estimated value c. Note that T is a total sample number of the voice signal s of the estimation object, and s(t) (t=1, 2, . . . , T) is a t-th sample included in the voice signal s of the estimation object.
  • The impression estimation device and a learning device are special devices configured by reading a special program in a known or exclusive computer including a central processing unit (CPU: Central Processing Unit) and a main storage (RAM: Random Access Memory) or the like, for example. The impression estimation device and the learning device execute each processing under control of the central processing unit. Data inputted to the impression estimation device and the learning device and data obtained in each processing are stored in the main storage for example, and the data stored in the main storage is read out to the central processing unit as needed and utilized in other processings. Respective processing units of the impression estimation device and the learning device may be at least partially configured by hardware such as an integrated circuit. Respective storage units included in the impression estimation device and the learning device can be configured by the main storage such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store, for example. The respective storage units are not always needed to be provided inside the impression estimation device and the learning device, and may be configured by an auxiliary storage configured by a semiconductor memory device such as a hard disk, an optical disk or a flash memory, and provided outside the impression estimation device and the learning device.
  • Hereinafter, the respective units will be described.
  • <First Section Segmentation Unit 111 and Second Section Segmentation Unit 121>
  • The first section segmentation unit 111 receives the voice signal s=[s(1), s(2), . . . , s(T)] as the input, uses analysis time length parameters p1 and s1, defines an analysis time length (analysis window width) as p1 and a shift width as s1, segments an analysis section w1(i,j) from the voice signal s (S111), and outputs it. The analysis section w1(i,j) can be expressed as follows for example.
  • w 1 ( i , j ) = s ( s 1 * i + j ) ( 0 i [ ( T - s 1 ) s 1 ] = I 1 , 1 j p 1 ) [ Math . 1 ]
  • Provided that i is a frame number and j is a sample number within the frame number i. I1 is a total number of analysis sections when segmenting the voice signal of the estimation object by the analysis time length p1 and the shift width s1. The analysis section w1(i,j) may be multiplied with a window function of a Hamming window or the like.
  • The second section segmentation unit 121 receives the voice signal s=[s(1), s(2), . . . , s(T)] as the input, uses analysis time length parameters p2 and s2, defines the analysis time length (analysis window width) as p2 and the shift width as s2, segments an analysis section w2(i′,j′) from the voice signal s (S121), and outputs it. Provided that it is
  • w 2 ( i , j ) = s ( s 2 * i + j ) 0 i [ ( T - s 2 ) s 2 ] = I 2 , 1 j p 2 [ Math . 2 ]
  • i′ is the frame number and j′ is the sample number within the frame number i′. I2 is the total number of the analysis sections when segmenting the voice signal of the estimation object by the analysis time length p2 and the shift width s2.
  • Here, as the analysis window width p2, a value to be p1≠p2 is set. When p1<p2 holds, the larger analysis window width p2 makes it easier to analyze a rhythm change of sound since analysis time is long. For example, in the case where a sampling frequency of voice is 16000 Hz, parameters can be set as p1=400(0.025 second), s1=160 (0.010 second), p2=16000 (1 second) and s2=1600 (0.100 second).
  • <First Feature Amount Extraction Unit 112 and Second Feature Amount Extraction Unit 122>
  • The first feature amount extraction unit 112 receives the analysis section w1(i,j) as the input, extracts a feature amount f1(i,k) from the analysis section w1(i,j) (S112), and outputs it. Provided that k is a dimensional number of the feature amount, and is k=1, 2, . . . , K1. An example of a feature amount F1(i)=[f1(i,1), f1(i,2), . . . , f1(i,k), . . . , f1(i,K1)] is illustrated in FIG. 3. As the feature amount, MFCC which expresses a vocal tract characteristic of the voice, F0 extraction which expresses the pitch of the voice and power which expresses volume of the voice or the like are possible. The feature amounts may be extracted using a known method. In the example, the first feature amount extraction unit 112 extracts the feature amount regarding at least either of the vocal tract and the pitch of the voice.
  • The second feature amount extraction unit 122 receives the analysis section w2(i′,j′) as the input, extracts a feature amount f2(i′,k′) from the analysis section w2(i′,j′) (S122), and outputs it. Provided that k′=1, 2, . . . , K2. When p1<p2 holds, as the feature amount, the feature amount which captures the overall change such as EMS (Envelope Modulation Spectra) (Reference Literature 1) is possible.
  • (Reference Literature 1) J. M. Liss et al., “Discriminating Dysarthria Type From Envelope Modulation Spectra”, J Speech Lang Hear Res. A, 2010.
  • In the example, the second feature amount extraction unit 122 extracts the feature amount regarding the rhythm of the voice signal.
  • In other words, p2 of the second section segmentation unit 121 is set so as to extract the feature amount regarding the rhythm of the voice signal in the second feature amount extraction unit 122, and p1 of the first section segmentation unit 111 is set so as to extract the feature amount regarding at least either of the vocal tract and the pitch of the voice in the first feature amount extraction unit 112.
  • <First feature amount vector conversion unit 113 and second feature amount vector conversion unit 123>
  • The first feature amount vector conversion unit 113 receives the feature amount f1(i,k) as the input, converts the feature amount f1(i,k) to a feature amount vector V1 which contributes to determination of the emergency degree (S113), and outputs it. Conversion to the feature amount vector is performed by a known technique such as acquisition of statistics of the mean and variance or the like of a feature amount series or a method of converting time sequential data to the vector by a neural network (LSTM (Long short-term memory) or the like).
  • For example, in the case of taking the mean and the variance, vectorization is possible as follows.
  • V 1 = [ v 1 ( 1 ) , v 1 ( 2 ) , , v 1 ( K 1 ) ] v 1 ( k ) = [ mean ( F 1 ( k ) ) , var ( F 1 ( k ) ) ] F 1 ( k ) = [ f 1 ( 1 , k ) , f 1 ( 2 , k ) , , f 1 ( I 1 , k ) ] mean ( F 1 ( k ) ) = i = 1 I 1 f 1 ( i , k ) I 1 var ( F 1 ( k ) ) = i = 1 I 1 ( f 1 ( i , k ) - mean ( F 1 ( k ) ) ) 2 I 1 [ Math . 3 ]
  • The second feature amount vector conversion unit 123 similarly receives the feature amount f2(i′,k′) as the input, converts the feature amount f2(i′,k′) to a feature amount vector V2=[v2(1), v2 (2), . . . , v2 (K2)] which contributes to the determination of the emergency degree (S123), and outputs it. For a conversion method, the method similar to that of the first feature amount vector conversion unit 113 may be used or a different method may be used.
  • <Connection Unit 130>
  • The connection unit 130 receives the feature vectors V1 and V2 as the input, connects the feature amount vectors V1 and V2, obtains a connected vector V=[V1,V2] to be used for emergency degree determination (S130), and outputs it.
  • Other than simple vector connection, the connection unit 130 can perform connection by addition or the like when the dimensional numbers K1 and K2 are the same.
  • <Impression Estimation Unit 140>
  • The impression estimation unit 140 receives the connected vector V as the input, estimates whether the voice signal s is the emergency or the non-emergency from the connected vector V (S140), and outputs the estimated value c (emergency label). A class of the emergency and the non-emergency is estimated by a general machine learning method of SVM (Support Vector Machine), Random Forest or the neural network or the like. While an estimation model needs to be learned beforehand upon estimation, learning data is prepared and learning is performed by a general method. The learning device which learns the estimation model will be described later. The estimation model is a model which inputs the connected vector V and outputs the estimated value of the impression of the voice signal. For example, the impression of the estimation object is the emergency or the non-emergency. That is, the impression estimation unit 140 turns the connected vector V to the input of the estimation model and obtains the estimated value which is the output of the estimation model.
  • Compared to the prior art, by capturing a feature regarding the rhythm, estimation accuracy of the impression is improved.
  • In the prior art, an average speech speed of a call is obtained by the voice recognition (see Non-Patent Literature 2). However, since the voice with the high emergency degree is in a speech style of quickly telling content while thinking, the fluctuation of the speech speed becomes large and an irregular rhythm is generated. Transition of a second feature amount (EMS) for which the analysis window is made long is illustrated in FIG. 4. FIG. 4 is a first main component when main component analysis is performed for the EMS. While the voice in the emergency irregularly changes, the voice in the non-emergency stably vibrates. By using the long-time analysis window in this way, it is recognized that a difference in the rhythm appears in the second feature amount.
  • The impression can be estimated without obtaining the speech speed and a voice recognition result by obtaining the rhythm of the speech as the feature amount in the long-time analysis section of the present embodiment, in addition to the feature that the pitch of the voice becomes high and the feature that the intensity becomes high in the case of the voice in the emergency, that have been used in the prior art.
  • <Learning Device 200>
  • FIG. 5 illustrates a functional block diagram of the learning device relating to the first embodiment, and FIG. 6 illustrates the processing flow.
  • The learning device 200 includes a first section segmentation unit 211, a first feature amount extraction unit 212, a first feature amount vector conversion unit 213, a second section segmentation unit 221, a second feature amount extraction unit 222, a second feature amount vector conversion unit 223, a connection unit 230, and a learning unit 240.
  • The learning device 200 receives a voice signal for learning sL and an impression label for learning cL as the input, learns the estimation model which estimates the impression of the voice signal, and outputs the learned estimation model. The impression label cL may be manually imparted before learning or may be obtained beforehand from a voice signal for learning sL by some means and imparted.
  • The first section segmentation unit 211, the first feature amount extraction unit 212, the first feature amount vector conversion unit 213, the second section segmentation unit 221, the second feature amount extraction unit 222, the second feature amount vector conversion unit 223 and the connection unit 230 perform processing S211, S212, S213, S221, S222, S223 and S230 similar to the processing S111, S112, S113, S121, S122, S123 and S130 of the first section segmentation unit 111, the first feature amount extraction unit 112, the first feature amount vector conversion unit 113, the second section segmentation unit 121, the second feature amount extraction unit 122, the second feature amount vector conversion unit 123 and the connection unit 130, respectively. However, the processing is performed to the voice signal for learning sL and information originated from the voice signal for learning sL, instead of the voice signal s and information originated from the voice signal s.
  • <Learning Unit 240>
  • The learning unit 240 receives a connected vector VL and the impression label cL as the input, learns the estimation model which estimates the impression of the voice signal (S240), and outputs the learned estimation model. Note that the estimation model may be learned by the general machine learning method of the SVM (Support Vector Machine), the Random Forest or the neural network or the like.
  • <Effect>
  • By the above configuration, the impression can be estimated with free speech content without the need of the voice recognition.
  • <Modification>
  • The first feature amount vector conversion unit 113, the second feature amount vector conversion unit 123, the connection unit 130 and the impression estimation unit 140 of the present embodiment may be expressed by one neural network. The entire neural network may be referred to as an estimation unit. In addition, the first feature amount vector conversion unit 113, the second feature amount vector conversion unit 123, the connection unit 130 and the impression estimation unit 140 of the present embodiment may be referred to as the estimation unit altogether. In either case, the estimation unit estimates the impression of the voice signal s using the first feature amount f1(i,k) obtained based on the analysis time length p1 for the voice signal s and the second feature amount f2(i′,k′) obtained based on the analysis time length p2 for the voice signal s.
  • Similarly, the first feature amount vector conversion unit 213, the second feature amount vector conversion unit 223, the connection unit 230 and the learning unit 240 may be expressed by one neural network to perform learning. The entire neural network may be referred to as the learning unit. In addition, the first feature amount vector conversion unit 213, the second feature amount vector conversion unit 223, the connection unit 230 and the learning unit 240 of the present embodiment may be referred to as the learning unit altogether. In either case, the learning unit learns the estimation model which estimates the impression of the voice signal using the first feature amount for learning f1,L(i,k) obtained based on the first analysis time length p1 for the voice signal for learning sL, the second feature amount for learning f2,L(i′,k′) obtained based on the second analysis time length p2 for the voice signal for learning sL, and the impression label cL imparted to the voice signal for learning sL.
  • Further, while the impression of the emergency degree is estimated in the present embodiment, even the impression of something other than the emergency degree can be the object of the estimation as long as it is the impression in which the rhythm is changed by the difference of the impression.
  • Second Embodiment
  • The description will be given with a focus on a part different from the first embodiment.
  • In the present embodiment, the emergency degree is estimated using long-time feature amount statistics.
  • FIG. 7 illustrates a functional block diagram of the impression estimation device relating to the second embodiment, and FIG. 8 illustrates the processing flow.
  • An impression estimation device 300 includes the first section segmentation unit 111, the first feature amount extraction unit 112, the first feature amount vector conversion unit 113, a statistic calculation unit 311, a third feature amount vector conversion unit 323, the connection unit 130 and the impression estimation unit 140.
  • In the present embodiment, the second section segmentation unit 121, the second feature amount extraction unit 122 and the second feature amount vector conversion unit 123 are removed from the impression estimation device 100, and the statistic calculation unit 311 and the third feature amount vector conversion unit 323 are added. The other configuration is similar to the first embodiment.
  • <Statistic Calculation Unit 311>
  • The statistic calculation unit 311 receives the feature amount f1(i,k) as the input, calculates a statistic using analysis time length parameters p3 and s3 (S311), and obtains and outputs a feature amount f3(i″,k)=[f3(i″,k,1), f3(i″,k,2), . . . , f3(i″,k,k″), . . . , f3(i″,k, K3)]. It is k″=1, 2, . . . , K3 and 0≤i″≤I3, i″ is an index of the statistic, p3 is a sample number when calculating the statistic from the feature amount f1(i,k), and s3 is a shift width when calculating the statistic from the feature amount f1(i,k). I3 is the total number of calculating the statistic. A value to be p3>2 is set. When p3>2 holds, p3 pieces of the feature amount f1(i,k) are used, the analysis time becomes s1×(p3−1)+p1 and longer than p1, and it becomes easy to analyze the rhythm change of the sound. Here, the analysis time length s1×(p3−1)+p1 corresponds to the analysis time p2 in the first embodiment. The statistic calculation unit 311 performs the analysis of the long-time window width and conversion to the feature amount regarding the rhythm similar to the first embodiment by calculating the statistic for the window width s1×(p3−1)+p1 of a fixed section based on the feature amount f1(i,k) obtained by the analysis of a short-time window width. For the statistic, for example, a mean ‘mean’, a standard deviation ‘std’, a maximum value ‘max’, a kurtosis ‘kurtosis’, skewness ‘skewness’ and a mean absolute deviation ‘mad’ can be obtained, and a computation expression is as follows, respectively.

  • f 3(i″,k)=[mean(i″,F 1(k)),std(i″,F 1(k)),max(i″,F 1(k)),kurtosis(i″,F 1(k)),skewness(i″,F 1(k)),mad(i″,F 1(k))]
  • Note that the statistic becomes the feature amount indicating the degree of the change of the sound in the respective sections, when MFCC is used for example, and the change degree becomes the feature amount related to the rhythm.
  • mean ( i , F 1 ( k ) ) = i = 1 p 3 f 1 ( s 3 * i + i , k ) p 3 std ( i , F 1 ( k ) ) = i = 1 p 3 ( f 1 ( s 3 * i + i , k ) - mean ( i , F 1 ( k ) ) ) 2 p 3 - 1 max ( i , F 1 ( k ) ) = max 1 i p 3 f 1 ( s 3 * i + i , k ) kurtosis = ( i , F 1 ( k ) ) = p 3 ( p 3 + 1 ) i = 1 p 3 ( f 1 ( s 3 * i + i , k ) - mean ( i , F 1 ( k ) ) ) 4 ( p 3 - 1 ) * ( p 3 - 2 ) * ( p 3 - 3 ) * ( std ( i , k ) ) 4 skewness ( i , F 1 ( k ) ) = p 3 i = 1 p 3 ( f 1 ( s 3 * i + i , k ) - mean ( i , F 1 ( k ) ) ) 3 ( p 3 - 1 ) * ( p 3 - 2 ) * ( std ( i , k ) ) 3 ( i F 1 ( k ) ) mad ( i , F 1 ( k ) ) = i = 1 p 3 "\[LeftBracketingBar]" ( f 1 ( s 3 * i + i , k ) - mean ( i , F 1 ( k ) ) "\[RightBracketingBar]" p 3 [ Math . 4 ]
  • <Third Feature Amount Vector Conversion Unit 323>
  • The third feature amount vector conversion unit 323 receives the feature amount f3(i″,k′) as the input, converts the feature amount f3(i″,k′) to a feature amount vector V3=[v3(1), v3(2), . . . , v3(K1)] which contributes to the determination of the emergency degree (S323), and outputs it. By the method similar to the first embodiment, the vectorization is made possible. For example, in the case of taking the mean and the variance, the vectorization is possible as follows.
  • V 3 = [ v 3 ( 1 ) , v 3 ( 2 ) , , v 3 ( K 1 ) ] v 3 ( k ) = [ mean ( F 3 ( k ) ) , var ( F 3 ( k ) ) ] F 3 ( k ) = [ f 3 ( 1 , k ) , f 3 ( 2 , k ) , , f 3 ( I 3 , k ) ] mean ( F 3 ( k ) ) = [ mean ( f 3 ( k , 1 ) ) , mean ( f 3 ( k , 2 ) ) , , mean ( f 3 ( k , K 3 ) ) ] f 3 ( i , k ) = [ f 3 ( i , k , 1 ) , f 3 ( i , k , 2 ) , , f 3 ( i , k , K 3 ) ] mean ( f 3 ( k , k ) ) = i = 1 I 3 ( f 3 ( i , k , k ) ) I 3 var ( F 3 ( k ) ) = [ var ( f 3 ( k , 1 ) ) , var ( f 3 ( k , 2 ) ) , , var ( f 3 ( k , K 3 ) , ) ] var ( f 3 ( k , k ) ) = i = 1 I 3 ( f 3 ( i , k , k ) - mean ( f 3 ( k , k ) ) ) 2 I 3 [ Math . 5 ]
  • Note that the connection unit 130 performs the processing S130 by using the feature amount vector V3 instead of the feature amount vector V2.
  • <Learning Device 400>
  • FIG. 9 illustrates a functional block diagram of the learning device relating to the second embodiment, and FIG. 10 illustrates the processing flow.
  • The learning device 400 includes the first section segmentation unit 211, the first feature amount extraction unit 212, the first feature amount vector conversion unit 213, a statistic calculation unit 411, a third feature amount vector conversion unit 423, the connection unit 230 and the learning unit 240.
  • The learning device 400 receives a voice signal for learning sL(t) and the impression label for learning cL as the input, learns the estimation model which estimates the impression of the voice signal, and outputs the learned estimation model.
  • The statistic calculation unit 411 and the third feature amount vector conversion unit 423 perform processing S411 and S423 similar to the processing S311 and S323 of the statistic calculation unit 311 and the third feature amount vector conversion unit 323, respectively. However, the processing is performed to the voice signal for learning sL(t) and information originated from the voice signal for learning sL(t), instead of the voice signal s (t) and information originated from the voice signal s (t). The other configuration is as described in the first embodiment. Note that the connection unit 230 performs the processing S230 using the feature amount vector V3,L instead of the feature amount vector V2,L.
  • <Effect>
  • By attaining such a configuration, the effect similar to that of the first embodiment can be obtained.
  • <Modification 1>
  • The first embodiment and the second embodiment may be combined.
  • As illustrated with broken lines in FIG. 7, the impression estimation device 300 includes the second section segmentation unit 121, the second feature amount extraction unit 122 and the second feature amount vector conversion unit 123 in addition to the configuration of the second embodiment.
  • As illustrated with broken lines in FIG. 8, the impression estimation device 300 performs S121, S122 and S123 in addition to the processing in the second embodiment.
  • The connection unit 130 receives the feature amount vectors V1, V2 and V3 as the input, connects the feature amount vectors V1, V2 and V3, obtains a connected vector V=[V1,V2,V3] to be used for the emergency degree determination (S130), and outputs it.
  • Similarly, as illustrated in FIG. 9, the learning device 400 includes the second section segmentation unit 221, the second feature amount extraction unit 222 and the second feature amount vector conversion unit 223 in addition to the configuration of the second embodiment.
  • In addition, as illustrated in FIG. 10, the learning device 400 performs S221, S222 and S223 in addition to the processing in the second embodiment.
  • The connection unit 230 receives the feature amount vectors V1,L, V2,L and V3,L as the input, connects the feature amount vectors V1,L, V2,L and V3,L, obtains a connected vector VL=[V1,L,V2,L,V3,L] to be used for the emergency degree determination (S230), and outputs it.
  • <Effect>
  • By attaining such a configuration, an estimated result with higher accuracy than that of the second embodiment can be obtained.
  • <Experimental Result>
  • FIG. 11 illustrates results in the case with no second feature amount extraction unit, in the case of the first embodiment, in the case of the second embodiment and in the case of the modification 1 of the second embodiment.
  • In this way, it is recognized that the effect of the long-time feature amount by the first embodiment and the second embodiment is greater than that in the case of only the first feature amount.
  • <Modification 2>
  • Further, the first embodiment and the second embodiment may be used separately according to a language.
  • For example, the impression estimation device receives language information indicating a kind of the language as the input, estimates the impression in the first embodiment at the time of a certain language A, and estimates the impression in the second embodiment at the time of another language B. Note that the estimation accuracy of which embodiment is higher is determined beforehand for each language, and the embodiment with the higher accuracy is selected according to the language information at the time of the estimation. The language information may be estimated from the voice signal s (t) or may be inputted by a user.
  • <Other Modifications>
  • The present invention is not limited to the embodiments and modifications described above. For example, the various kinds of processing described above are not only time-sequentially executed according to the description but may also be executed in parallel or individually according to throughput of the device which executes the processing or needs. In addition, appropriate changes are possible without departing from the purpose of the present invention.
  • <Program and Recording Medium>
  • The various kinds of processing described above can be executed by making a recording unit 2020 of a computer illustrated in FIG. 12 read the program of executing respective steps of the method described above and making a control unit 2010, an input unit 2030 and an output unit 2040 or the like perform operations.
  • The program in which the processing content is described can be recorded in a computer-readable recording medium. Examples of the computer-readable recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium and a semiconductor memory or the like.
  • In addition, the program is distributed by selling, assigning or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded, for example. Further, the program may be distributed by storing the program in a storage of a server computer and transferring the program from the server computer to another computer via a network.
  • The computer executing such a program tentatively stores the program recorded in the portable recording medium or the program transferred from the server computer in its own storage first, for example. Then, when executing the processing, the computer reads the program stored in its own recording medium, and executes the processing according to the read program. In addition, as another execution form of the program, the computer may directly read the program from the portable recording medium and execute the processing according to the program, and further, every time the program is transferred from the server computer to the computer, the processing according to the received program may be successively executed. In addition, the processing described above may be executed by a so-called ASP (Application Service Provider) type service which achieves a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to the computer. Note that the program in the present embodiment includes the information which is provided for the processing by an electronic computer and which is equivalent to the program (data which is not a direct command to the computer but has a property of stipulating the processing of the computer or the like).
  • In addition, while the present device is configured by executing a predetermined program on the computer in the present embodiment, at least part of the processing content may be achieved in a hardware manner.

Claims (22)

1. An impression estimation device comprising circuit configured to execute a method comprising:
estimating an impression of a voice signal s by defining p1<p2 and using a first feature amount obtained based on a first analysis time length p1 for the voice signal s and a second feature amount obtained based on a second analysis time length p2 for the voice signal s.
2. The impression estimation device according to claim 1,
wherein the first feature amount is a feature amount regarding at least either of a vocal tract and a voice pitch and the second feature amount is a feature amount regarding a rhythm of voice.
3. The impression estimation device according to claim 1,
wherein the second feature amount is a statistic calculated for the second analysis time length based on the first feature amount.
4. A learning device comprising circuit configured to execute a method comprising:
learning an estimation model which estimates an impression of a voice signal by defining p1<p2 and using a first feature amount for learning obtained based on a first analysis time length p1 for a voice signal for learning sL, a second feature amount for learning obtained based on a second analysis time length p2 for the voice signal for learning sL, and an impression label imparted to the voice signal for learning sL.
5. (canceled)
6. A learning method comprising
learning an estimation model which estimates an impression of a voice signal by defining p1<p2 and using a first feature amount for learning obtained based on a first analysis time length p1 for a voice signal for learning sL, a second feature amount for learning obtained based on a second analysis time length p2 for the voice signal for learning sL, and an impression label imparted to the voice signal for learning sL.
7. (canceled)
8. The impression estimation device according to claim 1, wherein the impression corresponds to emergency.
9. The impression estimation device according to claim 1, wherein the impression corresponds to non-emergency.
10. The impression estimation device according to claim 1, wherein the first feature amount indicates a vocal tract characteristic of a voice based on Mel-Frequency Cepstrum Coefficients.
11. The impression estimation device according to claim 1, wherein the estimating excludes recognizing speed of a voice associated with the voice signal s.
12. The learning device according to claim 4, wherein the first feature amount is a feature amount regarding at least either of a vocal tract and a voice pitch and the second feature amount is a feature amount regarding a rhythm of voice.
13. The learning device according to claim 4, wherein the second feature amount is a statistic calculated for the second analysis time length based on the first feature amount.
14. The learning device according to claim 4, wherein the impression corresponds to emergency.
15. The learning device according to claim 4, wherein the impression corresponds to non-emergency.
16. The learning device according to claim 4, wherein the first feature amount indicates a vocal tract characteristic of a voice based on Mel-Frequency Cepstrum Coefficients.
17. The learning device according to claim 4, wherein the learning an estimation model uses at least one of a Support Vector Machine, a Random Forest, or a neural network.
18. The learning method according to claim 6, wherein the first feature amount is a feature amount regarding at least either of a vocal tract and a voice pitch and the second feature amount is a feature amount regarding a rhythm of voice.
19. The learning method according to claim 6, wherein the second feature amount is a statistic calculated for the second analysis time length based on the first feature amount.
20. The learning method according to claim 6, wherein the impression corresponds to emergency.
21. The learning method according to claim 6, wherein the first feature amount indicates a vocal tract characteristic of a voice based on Mel-Frequency Cepstrum Coefficients.
22. The learning method according to claim 6, wherein the learning an estimation model uses at least one of a Support Vector Machine, a Random Forest, or a neural network.
US17/630,855 2019-07-29 2019-07-29 Impression estimation apparatus, learning apparatus, methods and programs for the same Pending US20220277761A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/029666 WO2021019643A1 (en) 2019-07-29 2019-07-29 Impression inference device, learning device, and method and program therefor

Publications (1)

Publication Number Publication Date
US20220277761A1 true US20220277761A1 (en) 2022-09-01

Family

ID=74228380

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/630,855 Pending US20220277761A1 (en) 2019-07-29 2019-07-29 Impression estimation apparatus, learning apparatus, methods and programs for the same

Country Status (3)

Country Link
US (1) US20220277761A1 (en)
JP (1) JPWO2021019643A1 (en)
WO (1) WO2021019643A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023119675A1 (en) * 2021-12-24 2023-06-29 日本電信電話株式会社 Estimation method, estimation device, and estimation program

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080046241A1 (en) * 2006-02-20 2008-02-21 Andrew Osburn Method and system for detecting speaker change in a voice transaction
US20090326947A1 (en) * 2008-06-27 2009-12-31 James Arnold System and method for spoken topic or criterion recognition in digital media and contextual advertising
US20110191101A1 (en) * 2008-08-05 2011-08-04 Christian Uhle Apparatus and Method for Processing an Audio Signal for Speech Enhancement Using a Feature Extraction
US20120089396A1 (en) * 2009-06-16 2012-04-12 University Of Florida Research Foundation, Inc. Apparatus and method for speech analysis
US20160045180A1 (en) * 2014-08-18 2016-02-18 Michael Kelm Computer-Aided Analysis of Medical Images
US20160294722A1 (en) * 2015-03-31 2016-10-06 Alcatel-Lucent Usa Inc. Method And Apparatus For Provisioning Resources Using Clustering
US20170069310A1 (en) * 2015-09-04 2017-03-09 Microsoft Technology Licensing, Llc Clustering user utterance intents with semantic parsing
US20170147292A1 (en) * 2014-06-27 2017-05-25 Siemens Aktiengesellschaft System For Improved Parallelization Of Program Code
US20170230844A1 (en) * 2016-02-10 2017-08-10 Samsung Electronics Co., Ltd FRAMEWORK FOR COMPREHENSIVE MONITORING AND LEARNING CONTEXT OF VoLTE CALL
US20170372725A1 (en) * 2016-06-28 2017-12-28 Pindrop Security, Inc. System and method for cluster-based audio event detection
US20180240538A1 (en) * 2017-02-18 2018-08-23 Mmodal Ip Llc Computer-Automated Scribe Tools
US20180247447A1 (en) * 2017-02-27 2018-08-30 Trimble Ab Enhanced three-dimensional point cloud rendering
US20190051299A1 (en) * 2018-06-25 2019-02-14 Intel Corporation Method and system of audio false keyphrase rejection using speaker recognition
US20190131016A1 (en) * 2016-04-01 2019-05-02 20/20 Genesystems Inc. Methods and compositions for aiding in distinguishing between benign and maligannt radiographically apparent pulmonary nodules
US10529357B2 (en) * 2017-12-07 2020-01-07 Lena Foundation Systems and methods for automatic determination of infant cry and discrimination of cry from fussiness
US10796715B1 (en) * 2016-09-01 2020-10-06 Arizona Board Of Regents On Behalf Of Arizona State University Speech analysis algorithmic system and method for objective evaluation and/or disease detection

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018180334A (en) * 2017-04-14 2018-11-15 岩崎通信機株式会社 Emotion recognition device, method and program
JP6982792B2 (en) * 2017-09-22 2021-12-17 株式会社村田製作所 Voice analysis system, voice analysis method, and voice analysis program
JP7000773B2 (en) * 2017-09-27 2022-01-19 富士通株式会社 Speech processing program, speech processing method and speech processing device
JP6856503B2 (en) * 2017-11-21 2021-04-07 日本電信電話株式会社 Impression estimation model learning device, impression estimation device, impression estimation model learning method, impression estimation method, and program
JP6996570B2 (en) * 2017-11-29 2022-01-17 日本電信電話株式会社 Urgency estimation device, urgency estimation method, program

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080046241A1 (en) * 2006-02-20 2008-02-21 Andrew Osburn Method and system for detecting speaker change in a voice transaction
US20090326947A1 (en) * 2008-06-27 2009-12-31 James Arnold System and method for spoken topic or criterion recognition in digital media and contextual advertising
US20110191101A1 (en) * 2008-08-05 2011-08-04 Christian Uhle Apparatus and Method for Processing an Audio Signal for Speech Enhancement Using a Feature Extraction
US20120089396A1 (en) * 2009-06-16 2012-04-12 University Of Florida Research Foundation, Inc. Apparatus and method for speech analysis
US20170147292A1 (en) * 2014-06-27 2017-05-25 Siemens Aktiengesellschaft System For Improved Parallelization Of Program Code
US20160045180A1 (en) * 2014-08-18 2016-02-18 Michael Kelm Computer-Aided Analysis of Medical Images
US20160294722A1 (en) * 2015-03-31 2016-10-06 Alcatel-Lucent Usa Inc. Method And Apparatus For Provisioning Resources Using Clustering
US20170069310A1 (en) * 2015-09-04 2017-03-09 Microsoft Technology Licensing, Llc Clustering user utterance intents with semantic parsing
US20170230844A1 (en) * 2016-02-10 2017-08-10 Samsung Electronics Co., Ltd FRAMEWORK FOR COMPREHENSIVE MONITORING AND LEARNING CONTEXT OF VoLTE CALL
US20190131016A1 (en) * 2016-04-01 2019-05-02 20/20 Genesystems Inc. Methods and compositions for aiding in distinguishing between benign and maligannt radiographically apparent pulmonary nodules
US20170372725A1 (en) * 2016-06-28 2017-12-28 Pindrop Security, Inc. System and method for cluster-based audio event detection
US10796715B1 (en) * 2016-09-01 2020-10-06 Arizona Board Of Regents On Behalf Of Arizona State University Speech analysis algorithmic system and method for objective evaluation and/or disease detection
US20180240538A1 (en) * 2017-02-18 2018-08-23 Mmodal Ip Llc Computer-Automated Scribe Tools
US20180247447A1 (en) * 2017-02-27 2018-08-30 Trimble Ab Enhanced three-dimensional point cloud rendering
US10529357B2 (en) * 2017-12-07 2020-01-07 Lena Foundation Systems and methods for automatic determination of infant cry and discrimination of cry from fussiness
US20190051299A1 (en) * 2018-06-25 2019-02-14 Intel Corporation Method and system of audio false keyphrase rejection using speaker recognition

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
Bhargava, Mayank, et al., "Improving automatic emotion recognition from speech using rhythm and temporal feature." arXiv preprint arXiv:1303.1761 (2013), pp. 139-147 (Year: 2013) *
Carbonell, Kathy M., et al. "Discriminating simulated vocal tremor source using amplitude modulation spectra." Journal of Voice 29.2 (2015), pp. 140-147 (Year: 2015) *
Chetouani, Mohamed, et al. "Time-scale feature extractions for emotional speech characterization: applied to human centered interaction analysis." Cognitive Computation 1 (2009): pp. 194-201. (Year: 2009) *
Cummins, Nicholas, et al. "An image-based deep spectrum feature representation for the recognition of emotional speech." Proceedings of the 25th ACM international conference on Multimedia. 2017, pp. 478-484 (Year: 2017) *
Felcyn, Jan, et al. "Automatic differentiation between normal and disordered speech." Energy 3 (2015), pp. 1-5. (Year: 2015) *
Koolagudi, Shashidhar G., et al. "Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features." International Journal of Speech Technology 15 (2012): pp. 495-511. (Year: 2012) *
Lefter, Iulia, et al. "Automatic Stress Detection in Emergency (Telephone) Calls," Int’l Journal of Intelligent Defence Support Systems (2011), pp. 1-20 (Year: 2011) *
Luengo, Iker, et al. "Feature analysis and evaluation for automatic emotion identification in speech." IEEE Transactions on Multimedia 12.6 (2010): pp. 490-501 (Year: 2010) *
Martinez, David, et al. "Prosodic features and formant modeling for an ivector-based language recognition system." 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6847-6851 (Year: 2013) *
Nalini, N., et al. "Speech emotion recognition using MFCC and AANN." Proc. International Conference on Engineering and Technology (2013), pp. 223-227 (Year: 2013) *
Palo, Hemanta Kumar, et al. "Comparative analysis of neural networks for speech emotion recognition." Int. J. Eng. Technol 7.4 (2018), pp. 111-126 (Year: 2018) *
Yadav, Jainath, et al. "Emotion recognition using LP residual at sub-segmental, segmental and supra-segmental levels." 2015 International Conference on Communication, Information & Computing Technology (ICCICT). IEEE, 2015, pp. 1-6 (Year: 2015) *

Also Published As

Publication number Publication date
JPWO2021019643A1 (en) 2021-02-04
WO2021019643A1 (en) 2021-02-04

Similar Documents

Publication Publication Date Title
CN110909613B (en) Video character recognition method and device, storage medium and electronic equipment
EP3955246A1 (en) Voiceprint recognition method and device based on memory bottleneck feature
CN111081279A (en) Voice emotion fluctuation analysis method and device
CN109658921B (en) Voice signal processing method, equipment and computer readable storage medium
CN109087667B (en) Voice fluency recognition method and device, computer equipment and readable storage medium
US11133022B2 (en) Method and device for audio recognition using sample audio and a voting matrix
CN108877779B (en) Method and device for detecting voice tail point
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN113555007B (en) Voice splicing point detection method and storage medium
CN111341333A (en) Noise detection method, noise detection device, medium, and electronic apparatus
CN114627868A (en) Intention recognition method and device, model and electronic equipment
WO2019107170A1 (en) Urgency estimation device, urgency estimation method, and program
CN116935889B (en) Audio category determining method and device, electronic equipment and storage medium
US20220277761A1 (en) Impression estimation apparatus, learning apparatus, methods and programs for the same
US20230095088A1 (en) Emotion recognition apparatus, emotion recognition model learning apparatus, methods and programs for the same
US11798578B2 (en) Paralinguistic information estimation apparatus, paralinguistic information estimation method, and program
CN112767950A (en) Voiceprint recognition method and device and computer readable storage medium
CN116705034A (en) Voiceprint feature extraction method, speaker recognition method, model training method and device
CN111932056A (en) Customer service quality scoring method and device, computer equipment and storage medium
CN116741155A (en) Speech recognition method, training method, device and equipment of speech recognition model
CN116450943A (en) Artificial intelligence-based speaking recommendation method, device, equipment and storage medium
CN116129872A (en) Voiceprint feature construction method, identity recognition method and related devices
CN113782005B (en) Speech recognition method and device, storage medium and electronic equipment
CN114171032A (en) Cross-channel voiceprint model training method, recognition method, device and readable medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: NIPPON TELEGRAPH AND TELEPHONE CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KAMIYAMA, HOSANA;ANDO, ATSUSHI;KOBASHIKAWA, SATOSHI;REEL/FRAME:058799/0194

Effective date: 20210129

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED