WO2022239142A1 - Voice recognition device and voice recognition method - Google Patents

Voice recognition device and voice recognition method Download PDF

Info

Publication number
WO2022239142A1
WO2022239142A1 PCT/JP2021/018019 JP2021018019W WO2022239142A1 WO 2022239142 A1 WO2022239142 A1 WO 2022239142A1 JP 2021018019 W JP2021018019 W JP 2021018019W WO 2022239142 A1 WO2022239142 A1 WO 2022239142A1
Authority
WO
WIPO (PCT)
Prior art keywords
conversation
speaker
degree
data
unit
Prior art date
Application number
PCT/JP2021/018019
Other languages
French (fr)
Japanese (ja)
Inventor
なるみ 細川
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2021/018019 priority Critical patent/WO2022239142A1/en
Publication of WO2022239142A1 publication Critical patent/WO2022239142A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Definitions

  • the present disclosure relates to a speech recognition device and a speech recognition method.
  • Some voice recognition devices that recognize user's utterances include a voice recognition device that starts voice recognition of user's utterances only when the user utters a wake-up word (hereinafter referred to as "conventional voice recognition device").
  • a wake-up word is a word that instructs the start of speech recognition.
  • a conventional speech recognition apparatus starts speech recognition only when a user utters a wakeup word, thereby avoiding a situation in which speech recognition is erroneously started when a user utters a word other than the wakeup word. can be avoided.
  • the user since the user has to utter a wake-up word every time the conventional speech recognition apparatus starts speech recognition, the user may feel annoyed.
  • Patent Document 1 discloses a speech recognition device that starts speech recognition of a user's utterance even if the user does not utter a wake-up word.
  • the response determination unit calculates a likelihood score, which is an index for determining whether or not the user's utterance is an utterance to the speech recognition device, based on the context of the user's series of utterances. is doing. Also, in the speech recognition apparatus, if the ID of the user who spoke last time is different from the ID of the user who is speaking this time, the likelihood score is lowered. The response determination unit determines that the user's utterance is an utterance to the speech recognition device if the likelihood score is greater than or equal to a threshold.
  • the speech recognition device disclosed in Patent Document 1 lowers the likelihood score when the ID of the user who spoke last time is different from the ID of the user who is speaking this time.
  • Another user speaks immediately after one user speaks there are cases where it takes a long time for another user to speak after one user speaks.
  • Lowering the likelihood score as in is not necessarily an appropriate process. Therefore, with this speech recognition device, it is difficult to distinguish whether the user's utterance is an utterance to the speech recognition device or is part of a conversation between two users. There is a problem that when the utterance is directed to the user, it may be misidentified as a conversation between users.
  • the present disclosure has been made to solve the above problems. It is an object of the present invention to obtain a speech recognition device and a speech recognition method that can reduce the number of speech recognition devices compared to the disclosed speech recognition device.
  • a speech recognition device recognizes an utterance among a plurality of users existing in a space based on an image in the space captured by a camera or sound in the space collected by a microphone.
  • a speaker identification unit that identifies a speaker who is a user who is speaking, a user other than the speaker identified by the speaker identification unit among a plurality of users, and the degree of past conversation with the speaker Only when conversation degree data is acquired, and based on the conversation degree data, it is determined whether or not the speaker's utterance is for the speech recognition device, and it is determined that it is for the speech recognition device.
  • a response unit that generates response data to the utterance of the speaker.
  • FIG. 1 is a configuration diagram showing a speech recognition device 5 according to Embodiment 1;
  • FIG. 2 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to Embodiment 1.
  • FIG. 2 is a hardware configuration diagram of a computer in which the speech recognition device 5 is implemented by software, firmware, or the like;
  • FIG. 2 is a flowchart showing a speech recognition method, which is a processing procedure of the speech recognition device 5 shown in FIG. 1;
  • 2 is a flow chart showing a processing procedure of a conversation degree updating unit 15 in the speech recognition device 5 shown in FIG. 1;
  • 2 is a flow chart showing a processing procedure of a conversation degree updating unit 15 in the speech recognition device 5 shown in FIG. 1;
  • 2 is a configuration diagram showing a speech recognition device 5 according to Embodiment 2.
  • FIG. 2 is a hardware configuration diagram showing hardware of a speech recognition device 5 according to Embodiment 2.
  • FIG. 11 is a configuration diagram showing a speech recognition device 5 according to Embodiment 3;
  • FIG. 11 is a hardware configuration diagram showing hardware of a speech recognition device 5 according to Embodiment 3;
  • FIG. 1 is a configuration diagram showing a speech recognition device 5 according to Embodiment 1.
  • FIG. 2 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to the first embodiment.
  • Camera 1 is implemented by, for example, an infrared camera, a visible light camera, or an ultraviolet camera. The camera 1 captures an image of the space, and outputs video data representing an image of the space to the speech recognition device 5 .
  • the microphone 2 collects sounds in the space and outputs sound data representing the sounds in the space to the speech recognition device 5 .
  • Embodiment 1 will be described assuming that the space is a compartment of a vehicle. Therefore, the multiple users present in the space are vehicle occupants. However, this is only an example, and the space may be, for example, a room within a building. When the space is a room in a building, the users existing in the space are residents, guests, or the like living in the room.
  • the in-vehicle sensor 3 is, for example, a pressure sensor installed in each of a plurality of seats, an infrared sensor installed in the vehicle, a GPS (Global Positioning System) sensor installed in the vehicle, or a sensor installed in the vehicle. It is realized by a gyro sensor that is For example, when the vehicle-mounted sensor 3 is implemented by a plurality of pressure sensors, the pressure sensor that senses the weight of the occupant among the plurality of pressure sensors outputs a sensing signal to the voice recognition device 5 . When the vehicle-mounted sensor 3 is realized by, for example, a GPS sensor, the vehicle-mounted sensor 3 outputs running position data indicating the position where the vehicle is running to the voice recognition device 5 .
  • a GPS Global Positioning System
  • the navigation device 4 is an in-vehicle device installed in the vehicle, or a device such as a smart phone brought into the vehicle by the user.
  • the navigation device 4 has, for example, a navigation function of guiding a route to a destination.
  • the navigation device 4 outputs, for example, route data indicating a route to a destination to the voice recognition device 5 .
  • the speech recognition device 5 includes a speaker identification unit 11 , a speaker presence/absence determination unit 14 , a conversation degree update unit 15 and a response unit 18 .
  • the voice recognition device 5 recognizes the voice of an occupant who is a user, and determines whether or not the voice of the occupant is directed to the voice recognition device 5 .
  • the speech recognition device 5 generates response data to the utterance of the speaker only when it determines that the speech is directed to the speech recognition device 5 .
  • the voice recognition device 5 outputs the response data to the vehicle-mounted device 6 and the output device 7, respectively.
  • the in-vehicle device 6 is, for example, a navigation device, an air conditioner device, or an audio device.
  • the vehicle-mounted device 6 operates according to the response data output from the speech recognition device 5 .
  • the output device 7 is, for example, a display, a lighting device, or a speaker.
  • the output device 7 operates according to the response data output from the speech recognition device 5 .
  • the speaker identification unit 11 is implemented by, for example, a speaker identification circuit 31 shown in FIG.
  • the speaker identification unit 11 includes an occupant identification unit 12 and a speaker identification processing unit 13 .
  • the speaker identification unit 11 detects the presence of a plurality of occupants present in the vehicle based on the image of the vehicle interior, which is the space captured by the camera 1, or the sound of the vehicle interior collected by the microphone 2. to identify the speaker who is the passenger who is speaking.
  • the speaker identification unit 11 outputs speaker data indicating the identified speaker to the speaker presence/absence determination unit 14 .
  • the occupant identifying unit 12 acquires a sensing signal from a pressure sensor that senses the weight of the occupant, among a plurality of pressure sensors included in the vehicle-mounted sensor 3 . By acquiring the sensing signal, the occupant identifying unit 12 identifies the pressure sensor outputting the sensing signal among the plurality of pressure sensors, and places the occupant in the seat where the identified pressure sensor is installed. is sitting. The occupant identification unit 12 acquires video data output from the camera 1 . The occupant identification unit 12 selects an image showing the face of each occupant (hereinafter referred to as a “face image”) from an image of an area including a seat on which each occupant is seated, among images of the interior of the vehicle indicated by the image data. ) is cut out.
  • face image an image showing the face of each occupant
  • the occupant identification unit 12 analyzes each face image to perform personal authentication of each occupant, and outputs identification information of each occupant to the speaker identification processing unit 13 .
  • the occupant identification unit 12 also outputs the face image of each occupant to the speaker identification processing unit 13 .
  • the occupant identification unit 12 may identify the position of the seat on which each occupant sits, based on the video data output from the camera 1, as will be described later.
  • the occupant identification unit 12 performs personal authentication of each occupant based on the video inside the vehicle.
  • the passenger identifying unit 12 will identify each passenger based on the sound in the vehicle indicated by the sound data. Authentication may be performed. That is, the occupant identification unit 12 extracts the voice of each occupant from the sound inside the vehicle indicated by the sound data, and performs voiceprint authentication of each voice, thereby performing personal authentication of each occupant.
  • the sounds in the cabin include the running sound of the vehicle, the sound of cold or warm air discharged from the air conditioner, the noise outside the vehicle, or the sound of music being played by audio equipment.
  • the speaker identification processing unit 13 acquires the identification information of each passenger and the face image of each passenger from the passenger identification unit 12 .
  • the speaker identification processing unit 13 identifies the occupant whose mouth is moving by analyzing each face image.
  • the speaker identification processing unit 13 assumes that the passenger whose mouth is moving is the speaker, and outputs the identification information of the speaker to each of the speaker presence/absence determination unit 14, the conversation degree update processing unit 17, and the response appropriateness determination unit 19. do.
  • the speaker identification processing unit 13 identifies the speaker based on the image inside the vehicle.
  • the speaker identification processing unit 13 may identify the speaker based on the sound inside the vehicle collected by the microphone 2 . That is, for example, when the microphones 2 are installed in each seat, the speaker identification processing unit 13 selects the seat where the microphone 2 that collects the loudest voice among the plurality of microphones 2 is installed. You may make it specify that the passenger
  • the speaker identification processing unit 13 may identify the speaker from the incoming direction of the voice to the microphone 2 . In these cases, the speaker identification processing unit 13 outputs sound data representing the voice of the identified speaker to the speaker presence/absence determination unit 14 .
  • the talker presence/absence determination unit 14 is realized by, for example, a talker presence/absence determination circuit 32 shown in FIG. Based on the number of speakers identified by the speaker identification unit 11, the speaker presence/absence determination unit 14 selects, from among a plurality of passengers, passengers who are conversing with the speaker identified by the speaker identification unit 11. Determine if a certain talker is present. That is, when the speaker identification processing unit 13 acquires the identification information of each of the plurality of speakers, the speaker presence/absence determination unit 14 determines that the speaker exists. The talker presence/absence determination unit 14 outputs to the conversation degree update unit 15 a determination result indicating whether or not there is a talker who is having a conversation with the speaker.
  • the conversation level update unit 15 is realized by, for example, the conversation level update circuit 33 shown in FIG.
  • the conversation level update unit 15 includes a conversation level data storage unit 16 and a conversation level update processing unit 17 .
  • the conversation degree update unit 15 acquires conversation degree data indicating the past conversation degree of the speaker.
  • the conversation degree update unit 15 acquires conversation degree data from the conversation degree data storage unit 16 inside.
  • the conversation level update unit 15 may acquire the conversation level data from outside the speech recognition device 5 .
  • Conversation degree update unit 15 updates the acquired conversation degree data so as to increase the degree of conversation when it is determined by talker presence/absence determination unit 14 that a talker exists.
  • the conversation degree update unit 15 updates the acquired conversation degree data so as to lower the degree of conversation when the conversation person presence/absence determination unit 14 determines that there is no conversation person.
  • the degree of conversation is the frequency of conversation between the speaker and the fellow passenger, the number of conversations, or the conversation time.
  • the conversation degree data storage unit 16 is a storage medium that stores conversation degree data indicating the degree of past conversation between a speaker and a passenger other than the speaker among the plurality of passengers present in the vehicle. be.
  • each passenger can be a speaker, so the past conversation degree data for each passenger is stored. That is, if there are, for example, two passengers in the vehicle, the conversation degree data storage unit 16 stores two pieces of conversation degree data, and if there are, for example, three passengers in the vehicle. If there is, the conversation level data storage unit 16 stores three pieces of conversation level data.
  • the conversation degree update processing unit 17 acquires the conversation degree data for the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the conversation degree data stored in the conversation degree data storage unit 16. . If the determination result output from the speaker presence/absence determining unit 14 indicates that a speaker is present, the conversation level update processing unit 17 updates the acquired conversation level data so as to increase the level of conversation. Update. If the determination result output from the speaker presence/absence determining unit 14 indicates that the speaker does not exist, the conversation level update processing unit 17 updates the acquired conversation level data so as to lower the level of conversation. Update. The conversation level update processing unit 17 stores the updated conversation level data in the conversation level data storage unit 16 .
  • the response unit 18 is implemented by, for example, a response circuit 34 shown in FIG.
  • the response unit 18 includes a response appropriateness determination unit 19 , a voice recognition unit 20 and a response data generation unit 21 .
  • the response unit 18 collects conversation degree data indicating the degree of past conversations with the speaker other than the speaker identified by the speaker identification unit 11 among the plurality of passengers present in the vehicle. get.
  • the response unit 18 determines whether or not the speech of the speaker is directed to the speech recognition device 5 based on the acquired conversation degree data.
  • the response unit 18 generates response data to the utterance of the utterer only when it is determined that the utterance of the utterer is directed to the speech recognition device 5 .
  • the response propriety determination unit 19 acquires conversation degree data for the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the conversation degree data stored in the conversation degree data storage unit 16.
  • the response propriety determination unit 19 includes the image of the interior of the vehicle, which is the space captured by the camera 1, the sound of the interior of the vehicle collected by the microphone 2, the traveling position data output from the vehicle-mounted sensor 3, and the data output from the navigation device 4.
  • a degree of response which is an index for determining whether or not the utterance of the utterer is directed to the speech recognition device 5, is calculated based on the obtained route data.
  • the response propriety determination unit 19 corrects the response degree based on the degree of conversation indicated by the acquired conversation degree data.
  • the degree of response after correction decreases as the degree of conversation indicated by the degree of conversation data increases. If the degree of response after correction is equal to or greater than the first threshold, the response suitability determination unit 19 determines that the utterance of the utterer is directed to the speech recognition device 5 . If the degree of response after correction is less than the first threshold, the response suitability determination unit 19 determines that the utterance of the utterer is not the utterance to the speech recognition device 5 .
  • the first threshold value may be stored in the internal memory of the response appropriateness determination unit 19 or may be given from the outside of the speech recognition device 5 .
  • the response propriety determination unit 19 determines that the utterance of the speaker is directed to the speech recognition device 5 if the degree of response after correction is equal to or greater than the first threshold. I am trying to do it. However, this is only an example, and if the degree of conversation indicated by the obtained conversation degree data is equal to or less than the second threshold, the response propriety determination unit 19 determines that the utterance of the speaker is directed to the speech recognition device 5. If the degree of conversation indicated by the acquired degree of conversation data is greater than the second threshold value, it may be determined that the utterance of the speaker is not the utterance to the speech recognition device 5. .
  • the second threshold value may be stored in the internal memory of the response appropriateness determination unit 19 or may be given from the outside of the speech recognition device 5 .
  • the voice recognition unit 20 performs voice recognition on the sound in the vehicle compartment collected by the microphone 2 and outputs voice recognition result data indicating the voice recognition result to the response data generation unit 21 .
  • the speech recognition result data may be text data or voice data. If the response propriety determination unit 19 determines that the utterance of the speaker is directed to the voice recognition device 5, the response data generation unit 21 generates the voice recognition result data output from the voice recognition unit 20. Generate response data for The response data generator 21 outputs the response data to the vehicle-mounted device 6 and the output device 7, respectively.
  • the speech recognition device 5 shown in FIG. 1 includes a speaker identification unit 11, a speaker presence/absence determination unit 14, a conversation degree update unit 15, and a response unit 18 as components. Any component of the speech recognition device 5 may be distributed to a server device connected to a network, a mobile terminal, or the like. If any component is distributed to a server device or the like, the speech recognition device 5 transmits data or the like given to the component to the server device or the like, and receives data or the like output from the component. It must be equipped with a transmitter/receiver that
  • each of the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation level update unit 15, and the response unit 18, which are the constituent elements of the speech recognition device 5, are implemented by dedicated hardware as shown in FIG. Assuming it will be implemented. That is, it is assumed that the speech recognition device 5 is realized by a speaker identification circuit 31, a speaker presence/absence determination circuit 32, a conversation level update circuit 33, and a response circuit .
  • Each of the speaker identification circuit 31, the talker presence/absence determination circuit 32, the conversation degree update circuit 33, and the response circuit 34 is, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or a combination thereof.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the components of the speech recognition device 5 are not limited to those realized by dedicated hardware, and the speech recognition device 5 may be realized by software, firmware, or a combination of software and firmware. good too.
  • Software or firmware is stored as a program in a computer's memory.
  • a computer means hardware that executes a program, for example, a CPU (Central Processing Unit), a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, a processor, or a DSP (Digital Signal Processor). do.
  • FIG. 3 is a hardware configuration diagram of a computer when the speech recognition device 5 is implemented by software, firmware, or the like.
  • the speech recognition device 5 is realized by software, firmware, or the like, there is a program for causing a computer to execute respective processing procedures in the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation degree update unit 15, and the response unit 18.
  • a program is stored in the memory 41 .
  • the processor 42 of the computer executes the program stored in the memory 41 .
  • FIG. 2 shows an example in which each component of the speech recognition device 5 is realized by dedicated hardware
  • FIG. 3 shows an example in which the speech recognition device 5 is realized by software, firmware, or the like.
  • this is only an example, and some components of the speech recognition device 5 may be implemented by dedicated hardware, and the remaining components may be implemented by software, firmware, or the like.
  • FIG. 4 is a flow chart showing a speech recognition method, which is a processing procedure of the speech recognition device 5 shown in FIG. 5A and 5B are flow charts showing the processing procedure of the conversation level updating unit 15 in the speech recognition device 5 shown in FIG.
  • the camera 1 captures the interior of the vehicle, and outputs image data showing the image of the interior of the vehicle to the occupant identification unit 12, the speaker presence/absence determination unit 14, and the response propriety determination unit 19, respectively.
  • the microphone 2 collects sounds in the vehicle interior and outputs sound data representing the sounds in the vehicle interior to the occupant identification unit 12, the speaker presence/absence determination unit 14, the response appropriateness determination unit 19, and the voice recognition unit 20, respectively.
  • the in-vehicle sensor 3 includes a pressurization sensor, an infrared sensor, a GPS sensor, a gyro sensor, or the like, and outputs sensor information indicating sensing results to the occupant identification unit 12 and the response propriety determination unit 19, respectively.
  • the navigation device 4 outputs setting data indicating the destination, route data indicating the route to the destination, voice guidance information, and the like to the response propriety determination unit 19 .
  • the occupant identifying unit 12 acquires a sensing signal from a pressure sensor that senses the weight of the occupant, among a plurality of pressure sensors included in the vehicle-mounted sensor 3 . By acquiring the sensing signal, the occupant identification unit 12 identifies the pressure sensor outputting the sensing signal among the plurality of pressure sensors. The occupant identification unit 12 determines that the occupant is sitting in the seat where the identified pressure sensor is installed. If the identified pressurization sensor is installed, for example, in the driver's seat, the occupant identifying unit 12 determines that the occupant is sitting in the driver's seat. If the specified pressure sensor is installed, for example, in the front passenger seat, the occupant specifying unit 12 determines that the passenger is sitting in the front passenger seat.
  • the occupant identification unit 12 acquires video data output from the camera 1 .
  • the occupant identification unit 12 cuts out a face image, which is an image showing the face of each occupant, from among the images of the interior of the vehicle indicated by the video data, from the video of the area including the seat on which each occupant is seated. I do.
  • the occupant identification unit 12 analyzes each face image to perform personal authentication of each occupant, and outputs identification information of each occupant to the speaker identification processing unit 13 (step ST1 in FIG. 4).
  • the occupant identification unit 12 also outputs the face image of each occupant to the speaker identification processing unit 13 .
  • the occupant identification unit 12 performs personal authentication of each occupant based on the video inside the vehicle. However, this is only an example, and if all the passengers present in the vehicle are vocalizing, the passenger identification unit 12 will determine the sound based on the sound in the vehicle indicated by the sound data output from the microphone 2. , personal authentication of each passenger may be performed. That is, the occupant identification unit 12 extracts the voice of each occupant from the sound in the vehicle interior indicated by the sound data, and performs voiceprint authentication of each voice, thereby performing personal authentication of each occupant. may
  • the speaker identification processing unit 13 acquires the identification information of each passenger and the face image of each passenger from the passenger identification unit 12 .
  • the speaker identification processing unit 13 searches for the passenger whose mouth is moving by analyzing each face image, and identifies the passenger whose mouth is moving as the speaker (step ST2 in FIG. 4).
  • the speaker identification processing unit 13 identifies the speaker based on the image inside the vehicle. However, this is only an example, and the speaker identification processing unit 13 may identify the speaker based on the sound inside the vehicle indicated by the sound data output from the microphone 2 . That is, for example, when the microphones 2 are installed in each seat, the speaker identification processing unit 13 selects the seat where the microphone 2 that collects the loudest voice among the plurality of microphones 2 is installed. You may make it specify that the passenger
  • the speaker identification processing unit 13 After identifying the speaker, the speaker identification processing unit 13 outputs the identification information of the speaker to the speaker presence/absence determination unit 14, the conversation degree update processing unit 17, and the response appropriateness determination unit 19, respectively. If speaker identification processing unit 13 identifies a plurality of speakers, speaker identification processing unit 13 outputs the identification information of each speaker to speaker presence/absence determination unit 14, conversation degree update processing unit 17, and response appropriateness determination unit 19, respectively.
  • the speaker presence/absence determination unit 14 converses with the speaker. It is determined that there is a speaker who is present (step ST4 in FIG. 4). If one piece of identification information is output from the speaker identification processing unit 13 as the identification information of the speaker (step ST3 in FIG. 4: NO), the speaker presence/absence determination unit 14 converses with the speaker. It is determined that there is no talker who is present (step ST5 in FIG. 4). The talker presence/absence determination unit 14 outputs to the conversation degree update processing unit 17 a determination result indicating whether or not there is a talker who is having a conversation with the speaker.
  • the conversation degree update processing unit 17 acquires speaker identification information from the speaker identification processing unit 13 and acquires a determination result indicating whether or not a speaker exists from the speaker presence/absence determination unit 14 .
  • the conversation level update processing unit 17 acquires past conversation level data for the speaker indicated by the acquired identification information from past conversation level data stored in the conversation level data storage unit 16 . If the past conversation level data for the speaker indicated by the identification information is not stored in the conversation level data storage unit 16, the conversation level update processing unit 17 updates the conversation level K indicated by the conversation level data for the speaker. Initialize.
  • the initial value of the degree of conversation K is 1, for example.
  • the conversation degree update processing unit 17 updates the acquired conversation degree data so as to increase the degree K of conversation if the obtained determination result indicates that there is a conversation partner (step in FIG. 4).
  • the conversation level update processing unit 17 updates the acquired conversation level data so as to lower the conversation level K (step in FIG. 4). ST7).
  • the conversation level update processing unit 17 stores the updated conversation level data in the conversation level data storage unit 16 .
  • the conversation degree update processing unit 17 acquires the identification information of the speaker from the speaker identification processing unit 13, it causes two counters (1) and (2) (not shown) to start counting (see FIG. 5A). step ST21, step ST31 in FIG. 5B). Counters ( 1 ) and ( 2 ) may be included in conversation level update processing section 17 or may be provided outside conversation level update processing section 17 .
  • the internal memory of the conversation degree update processing unit 17 stores the first set value and the second set value. First setting value ⁇ second setting value.
  • the degree-of-conversation update processing unit 17 causes the counter (1) to start counting, and before the count value of the counter (1) reaches the first set value, indicates that there is a speaker.
  • the conversation level data is updated so as to increase the conversation level K (step ST23 in FIG. 5A). That is, the conversation degree update processing unit 17 increases the conversation degree K by adding the degree change value CH to the current conversation degree K, for example, as shown in the following equation (1).
  • Update degree data is a preset value, such as 0.1.
  • the level change value CH may be stored in the internal memory of the conversation level update processing unit 17 or may be given from the outside of the speech recognition device 5 .
  • Post-update degree of conversation K current degree of conversation K+degree change value CH (1)
  • the conversation degree update processing unit 17 must receive a determination result indicating that there is a speaker before the count value of the counter (1) reaches the first set value (step ST22: If NO), the conversation degree data is not updated.
  • the conversation degree update processing unit 17 resets the count value of the counter (1) to zero (step ST24 in FIG. 5A).
  • the conversation degree update processing section 17 receives a determination result indicating that a speaker exists before the count value of the counter (2) reaches the second set value (step ST32 in FIG. 5B). : NO)
  • the conversation degree data is not updated.
  • the conversation degree update processing unit 17 resets the count value of the counter (2) to zero (step ST34 in FIG. 5B).
  • the conversation degree update processing unit 17 updates the conversation degree data so as to increase the degree K of conversation.
  • the conversation level update processing unit 17 may not update the conversation level data.
  • the conversation degree update processing unit 17 updates the conversation degree data so as to lower the degree K of conversation. Even if the count value of the counter (2) reaches the second set value, even if the determination result indicating the presence of the talker is not received, the current degree of conversation K has already reached the lower limit value. is reached, the conversation level update processing unit 17 may not update the conversation level data. As a result, it is possible to prevent the post-correction response degree P′ from becoming less than the first threshold value even if a conversation partner exists for a long period of time due to the conversation degree K becoming too small. can.
  • Each of the upper limit value and the lower limit value may be stored in the internal memory of the conversation level update processing unit 17 or may be given from the outside of the speech recognition device 5 .
  • the response propriety determination unit 19 acquires speaker identification information from the speaker identification processing unit 13 .
  • the response propriety determining unit 19 acquires conversation level data of the speaker indicated by the acquired identification information from the conversation level data stored in the conversation level data storage unit 16 .
  • the response propriety determination unit 19 acquires conversation level data after being updated by the conversation level update processing unit 17 .
  • the conversation degree data acquired by the response propriety determination unit 19 may be conversation degree data about the speaker. Conversation degree data may be acquired.
  • the response propriety determination unit 19 includes the image of the interior of the vehicle, which is the space captured by the camera 1, the sound of the interior of the vehicle collected by the microphone 2, the traveling position data output from the vehicle-mounted sensor 3, and the data output from the navigation device 4. Based on the received voice guidance information, etc., the degree of response P, which is an index for determining whether or not the utterance of the utterer is directed to the voice recognition device 5, is calculated (step ST8 in FIG. 4). Then, the response propriety determination unit 19 corrects the response degree P by dividing the response degree P by the conversation degree K indicated by the conversation degree data, as shown in the following equation (3). step ST9).
  • P' is the degree of response after correction, and decreases as the degree of conversation K increases.
  • the calculation process itself of the degree of response P may be any calculation process, for example, the calculation process disclosed in the above-mentioned Patent Document 1.
  • Patent Literature 1 a likelihood score corresponding to the degree of response P is calculated based on the context of a series of user's utterances.
  • the response propriety determination unit 19 calculates a large response degree P when sound is collected by the microphone 2 within a certain period of time after the voice guidance information is output from the navigation device 4 .
  • the degree of response P smaller than the degree of response P described above is calculated.
  • the response propriety determination unit 19 calculates a small response degree P when sounds are continuously collected by the microphone 2 .
  • the response degree P is calculated.
  • the response propriety determination unit 19 compares the corrected response degree P′ with the first threshold. If the degree of response P′ after correction is equal to or greater than the first threshold (step ST10 in FIG. 4: YES), the response propriety determination unit 19 determines that the utterance of the utterer is directed to the speech recognition device 5. is determined (step ST11 in FIG. 4). If the response degree P′ after correction is less than the first threshold (step ST10 in FIG. 4: NO), the response propriety determination unit 19 determines that the utterance of the utterer is not the utterance to the speech recognition device 5. is determined (step ST12 in FIG. 4).
  • the response propriety determination unit 19 determines that the utterance of the speaker is directed to the speech recognition device 5 if the response degree P′ after correction is equal to or greater than the first threshold. I am trying to make a judgment. However, this is only an example, and if the degree K of conversation indicated by the obtained conversation degree data is equal to or less than the second threshold, the response suitability determination unit 19 determines that the utterance of the speaker is an utterance to the speech recognition device 5. If it is determined that there is, and the degree of conversation K is greater than the second threshold, it may be determined that the utterance of the speaker is not the utterance to the speech recognition device 5 .
  • the voice recognition unit 20 acquires sound data output from the microphone 2 .
  • the speech recognition unit 20 speech-recognizes the sound indicated by the acquired sound data, and outputs speech recognition result data indicating the speech recognition result to the response data generation unit 21 . If the response propriety determination unit 19 determines that the utterance of the speaker is directed to the voice recognition device 5, the response data generation unit 21 generates the voice recognition result data output from the voice recognition unit 20. is generated (step ST13 in FIG. 4). If the response propriety determination unit 19 determines that the utterance of the speaker is not the utterance to the voice recognition device 5, the response data generation unit 21 does not generate response data for the voice recognition result data.
  • the processing itself for generating response data for speech recognition result data is a known technique, and detailed description thereof will be omitted.
  • the voice recognition result data is, for example, data indicating "cold”
  • the response data generation unit 21 sends the air conditioner, which is the in-vehicle device 6, a response content to the effect of "increase the set temperature by 1 degree", for example. Generate the response data shown.
  • the speech recognition result data is data indicating, for example, "the volume is low”
  • the response data generation unit 21 sends the audio device, which is the vehicle-mounted device 6, response content to the effect that, for example, "increase the playback volume”.
  • the response data generation unit 21 sends to the navigation device, which is the in-vehicle device 6, for example, "Set the destination to XX ⁇ Create response data indicating the content of the response to the effect that it will be set to the coast.
  • the response data generator 21 outputs the response data to the vehicle-mounted device 6 and the output device 7, respectively.
  • the vehicle-mounted device 6 acquires the response data output from the response data generator 21 .
  • the in-vehicle device 6 operates according to the response data. If the in-vehicle device 6 is an air conditioner and the response data indicates, for example, the content of the response to the effect that "increase the set temperature by 1 degree", the air conditioner as the in-vehicle device 6 operates to raise the set temperature by 1 degree. . If the in-vehicle device 6 is an audio device and the response data indicates, for example, response data indicating the content of the response to the effect that "increase the playback volume", the audio device as the in-vehicle device 6 increases the playback volume.
  • the navigation device that is the in-vehicle device 6 can Works to set the ground to ⁇ coast.
  • the output device 7 outputs the content of the response indicated by the response data output from the response data generator 21 . If the output device 7 is, for example, a display, the display, which is the output device 7, displays the content of the response indicated by the response data. If the output device 7 is, for example, a lighting device, the lighting device that is the output device 7 changes the color of the lighting so that it can be seen that the response data has been output from the speech recognition device 5 . If the output device 7 is, for example, a speaker, the speaker, which is the output device 7, outputs the content of the response indicated by the response data.
  • the speaker presence/absence determination unit 14 analyzes the motion of a user other than the speaker based on the image of the interior of the vehicle indicated by the video data output from the camera 1, and analyzes the result of the motion. , it may be determined whether or not a user different from the speaker is conversing with the speaker.
  • the determination processing of the speaker presence/absence determining unit 14 based on the motion analysis result will be specifically described below.
  • the speaker presence/absence determination unit 14 acquires speaker identification information from the speaker identification processing unit 13 and acquires video data output from the camera 1 .
  • the video data is video data temporally later than the video data output from the camera 1 to the occupant identification unit 12 .
  • the speaker presence/absence determination unit 14 cuts out facial images of users different from the speaker indicated by the identification information from the video of the vehicle interior indicated by the video data, and analyzes the facial images of each user to determine whether the mouth is moving. Identify users. If there is a user whose mouth is moving, the speaker presence/absence determination unit 14 determines that there is a speaker who is conversing with the speaker. If there is no user whose mouth is moving, the speaker presence/absence determination unit 14 determines that there is no speaker conversing with the speaker.
  • the speaker presence/absence determination unit 14 identifies the user whose mouth is moving by analyzing the face image of the user different from the speaker. However, this is only an example, and the speaker presence/absence determination unit 14 analyzes the face image of the user to determine whether the user is nodding, shaking his or her head, or looking at the speaker. etc. may be specified. If there is a user or the like nodding, the speaker presence/absence determination unit 14 determines that there is a speaker conversing with the speaker, and if there is no user or the like nodding, It is determined that there is no speaker who is having a conversation with the speaker.
  • the response propriety determination unit 19 corrects the response degree P based on the conversation degree K indicated by the conversation degree data, as shown in Equation (3). However, this is only an example, and the response propriety determination unit 19 subtracts the conversation degree K indicated by the conversation degree data from the response degree P as shown in the following equation (4). may be corrected.
  • P' PK (4)
  • the conversation degree update processing unit 17 sets, for example, 0 as the initial value of the degree of conversation K indicated by the conversation degree data of the speaker.
  • the passenger identification unit 12 performs personal authentication of each passenger, and then the speaker identification processing unit 13 identifies the speaker. is doing.
  • the occupant identification unit 12 may perform personal authentication of each occupant if there is no change in the seat of any occupant even if one of the occupants speaks.
  • the speaker identification processing unit 13 may identify the speaker. That is, the occupant identification unit 12 identifies the position of the seat on which each occupant is seated based on the image data output from the camera 1, and only when there is a change in the seat of any occupant, each occupant may be performed again.
  • An example of a mode in which there is a change in the seat of one of the occupants is when the occupant gets in and out of the vehicle.
  • a speaker identification unit 11 that identifies a speaker who is a user who is speaking, a user other than the speaker identified by the speaker identification unit 11 among a plurality of users, and past conversations with the speaker.
  • Conversation degree data indicating the degree is acquired, based on the conversation degree data, it is determined whether or not the utterance of the speaker is directed to the speech recognition device 5, and is determined to be directed to the speech recognition device 5.
  • the speech recognition device 5 is configured to include a response unit 18 that generates response data to the utterance of the speaker only when the speech recognition device 5 is performed. Therefore, the speech recognition device 5 reduces the probability of misrecognizing a conversation between users when the user's speech is directed to the speech recognition device 5, compared to the speech recognition device disclosed in Patent Document 1. can do.
  • Embodiment 2 a voice recognition device 5 having a travel purpose prediction unit 22 that predicts the travel purpose of the vehicle from the destination set in the navigation device 4 or the travel route of the vehicle will be described.
  • FIG. 6 is a configuration diagram showing a speech recognition device 5 according to Embodiment 2.
  • FIG. 7 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to the second embodiment.
  • the same reference numerals as those in FIGS. 1 and 2 denote the same or corresponding parts, so description thereof will be omitted.
  • the travel purpose prediction unit 22 is realized by, for example, a travel purpose prediction circuit 35 shown in FIG.
  • the travel purpose prediction unit 22 acquires destination setting data indicating the destination set in the navigation device 4 or travel route data indicating the travel route of the vehicle recorded in the navigation device 4 .
  • the travel purpose prediction unit 22 predicts the travel purpose of the vehicle from the destination indicated by the destination setting data or the travel route indicated by the travel route data.
  • the travel purpose prediction unit 22 outputs travel purpose data indicating the predicted travel purpose of the vehicle to the conversation degree update unit 23 .
  • the travel purpose prediction unit 22 acquires travel route data from the navigation device 4.
  • the vehicle-mounted sensor 3 is realized by a GPS sensor, for example, the travel purpose prediction unit 22 acquires GPS data output from the GPS sensor, and calculates the travel route of the vehicle from the GPS data. You may make it specify.
  • the travel purpose prediction unit 22 may acquire angular velocity data output from the gyro sensor and identify the travel route of the vehicle from the angular velocity data. .
  • the conversation level update unit 23 is implemented by, for example, a conversation level update circuit 36 shown in FIG.
  • the conversation level update unit 23 includes a conversation level data storage unit 24 and a conversation level update processing unit 25 .
  • the conversation level update unit 23 acquires speaker identification information from the speaker identification unit 11 and acquires travel purpose data from the travel purpose prediction unit 22 .
  • the conversation degree update unit 23 acquires the conversation degree data for the driving purpose indicated by the driving purpose data from among a plurality of conversation degree data for each driving purpose indicating the past conversation degree of the speaker indicated by the identification information.
  • the conversation level update unit 23 acquires the conversation level data from the conversation level data storage unit 24 inside.
  • the conversation level update unit 23 may acquire the conversation level data from outside the speech recognition device 5 .
  • Conversation degree update unit 23 updates the acquired conversation degree data so as to increase the degree of conversation when it is determined by talker presence/absence determination unit 14 that a talker exists.
  • Conversation degree update unit 23 updates the acquired conversation degree data so as to lower the degree of conversation when it is determined by talker presence/absence determination unit 14 that there is no talker.
  • the conversation degree update processing unit 25 acquires speaker identification information from the speaker identification processing unit 13 and acquires travel purpose data from the travel purpose prediction unit 22 .
  • the conversation degree update processing unit 25 updates the conversation of the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the plurality of conversation degree data for each driving purpose stored in the conversation degree data storage unit 24. Conversation level data for the driving purpose indicated by the driving purpose data, which is degree data, is acquired. If the determination result output from the speaker presence/absence determination unit 14 indicates that a speaker is present, the conversation level update processing unit 25 updates the acquired conversation level data so as to increase the level of conversation. Update.
  • the conversation level update processing unit 25 updates the acquired conversation level data so as to lower the level of conversation. Update.
  • the conversation level update processing unit 25 stores the updated conversation level data in the conversation level data storage unit 24 .
  • each of the speaker identification unit 11, the speaker presence/absence determination unit 14, the travel purpose prediction unit 22, the conversation degree update unit 23, and the response unit 18, which are the constituent elements of the speech recognition device 5, are configured as shown in FIG. It is assumed to be realized by special dedicated hardware. That is, it is assumed that the speech recognition device 5 is implemented by a speaker identification circuit 31, a speaker presence/absence determination circuit 32, a driving purpose prediction circuit 35, a conversation degree update circuit 36, and a response circuit 34.
  • Each of the speaker identification circuit 31, the speaker presence/absence determination circuit 32, the driving purpose prediction circuit 35, the conversation degree update circuit 36, and the response circuit 34 is, for example, a single circuit, a composite circuit, a programmed processor, or a parallel programmed circuit. Processors, ASICs, FPGAs, or combinations thereof are applicable.
  • the components of the speech recognition device 5 are not limited to those realized by dedicated hardware, and the speech recognition device 5 may be realized by software, firmware, or a combination of software and firmware. good too.
  • the speech recognition device 5 is realized by software or firmware, each processing procedure in the speaker identification unit 11, the speaker presence/absence determination unit 14, the driving purpose prediction unit 22, the conversation level update unit 23, and the response unit 18 is performed.
  • a program to be executed by the computer is stored in the memory 41 shown in FIG.
  • the processor 42 shown in FIG. 3 executes the program stored in the memory 41 .
  • FIG. 7 shows an example in which each component of the speech recognition device 5 is realized by dedicated hardware
  • FIG. 3 shows an example in which the speech recognition device 5 is realized by software, firmware, or the like.
  • this is only an example, and some components of the speech recognition device 5 may be implemented by dedicated hardware, and the remaining components may be implemented by software, firmware, or the like.
  • the navigation device 4 outputs destination setting data indicating the destination to the travel purpose prediction unit 22 .
  • the travel purpose prediction unit 22 acquires destination setting data from the navigation device 4 .
  • the travel purpose prediction unit 22 predicts the travel purpose of the vehicle from the destination indicated by the destination setting data. If the destination is an amusement park or a leisure facility such as a ball game ground, the travel purpose prediction unit 22 predicts that the travel purpose of the vehicle is leisure. If the destination is an office building or a business facility such as a factory, the travel purpose prediction unit 22 predicts that the travel purpose of the vehicle is business. If the destination is a shopping facility such as a department store or a supermarket, the travel purpose prediction unit 22 predicts that the travel purpose of the vehicle is shopping.
  • the travel purpose prediction unit 22 outputs travel purpose data indicating the travel purpose of the vehicle to the conversation level update unit 23 .
  • the travel purpose prediction unit 22 acquires travel route data indicating the travel route of the vehicle from the navigation device 4 .
  • the travel purpose prediction unit 22 acquires GPS data from the GPS sensor instead of acquiring travel route data from the navigation device 4, and calculates the travel route of the vehicle from the GPS data. may be specified.
  • the travel purpose prediction unit 22 may acquire angular velocity data from the gyro sensor and identify the travel route of the vehicle from the angular velocity data.
  • the travel purpose prediction unit 22 supplies the travel route data to a learning model (not shown), and acquires travel purpose data indicating the travel purpose of the vehicle from the learning model.
  • the travel purpose prediction unit 22 outputs the travel purpose data to the conversation degree update unit 23 .
  • the learning model machine-learns the driving purpose of the vehicle using driving route data indicating the driving route of the vehicle and teacher data indicating the driving purpose of the vehicle.
  • the learned learning model outputs travel purpose data indicating the travel purpose of the vehicle when given travel route data.
  • the travel purpose prediction unit 22 predicts the travel purpose of the vehicle using a learning model. However, this is only an example, and the travel purpose prediction unit 22 may predict the travel purpose of the vehicle using, for example, a rule base.
  • the conversation degree update processing unit 25 acquires speaker identification information from the speaker identification processing unit 13 and acquires travel purpose data from the travel purpose prediction unit 22 .
  • the conversation level update processing unit 25 updates the conversation level data of the speaker indicated by the identification information from among the plurality of conversation level data for each driving purpose stored in the conversation level data storage unit 24, and the driving purpose data. Acquire the degree of conversation data for the purpose of driving indicated by .
  • the conversation degree update processing unit 25 acquires a determination result indicating whether or not there is a speaker who is having a conversation with the speaker from the speaker presence/absence determination unit 14 .
  • the conversation degree update processing unit 25 updates the acquired conversation degree data so as to increase the degree K of conversation if the determination result indicates that there is a speaker.
  • the conversation level update processing unit 25 updates the acquired conversation level data so as to lower the conversation level K if the determination result indicates that there is no speaker.
  • the update processing of the conversation degree data by the conversation degree update processing unit 25 is the same as the update processing of the conversation degree data by the conversation degree update processing unit 17 shown in FIG. 1, so a specific description of the update processing is omitted.
  • the conversation level update processing unit 25 stores the updated conversation level data in the conversation level data storage unit 24 .
  • the space is the passenger compartment of the vehicle, and the plurality of users present in the space are the plurality of passengers in the vehicle, which are set in the navigation device 4.
  • the speech recognition device 5 shown in FIG. 6 is configured to include a travel purpose prediction unit 22 that predicts the travel purpose of the vehicle from the destination or the travel route of the vehicle. Further, the conversation degree updating unit 23 of the speech recognition device 5 updates the driving purpose predicted by the driving purpose prediction unit 22 from among a plurality of pieces of conversation degree data for each driving purpose indicating the past conversation degree of the speaker.
  • Conversation level data is acquired, and if it is determined by the speaker presence/absence determining unit 14 that a speaker exists, the acquired conversation level data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit 14 If it is determined that there is no talker, the acquired conversation degree data is updated so as to lower the degree of conversation. Therefore, the speech recognition device 5 shown in FIG. 6 has a higher probability than the speech recognition device 5 shown in FIG. can be reduced.
  • Embodiment 3 In the third embodiment, conversation degree data for the seat position identified by the speaker identification unit 11 is acquired from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker.
  • the speech recognition device 5 including the updating unit 26 will be described.
  • FIG. 8 is a configuration diagram showing a speech recognition device 5 according to Embodiment 3.
  • FIG. 9 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to the third embodiment.
  • the same reference numerals as those in FIGS. 1 and 2 denote the same or corresponding parts, so description thereof will be omitted.
  • the conversation level update unit 26 is implemented by, for example, a conversation level update circuit 37 shown in FIG.
  • the conversation level update unit 26 includes a conversation level data storage unit 27 and a conversation level update processing unit 28 .
  • the conversation level update unit 26 acquires seat position data indicating the positions of the seats on which the respective passengers are seated, and speaker identification information from the speaker identification unit 11 .
  • a conversation degree update unit 26 acquires the conversation degree data of the seat position indicated by the seat position data from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker indicated by the identification information.
  • the conversation degree update unit 26 acquires the conversation degree data from the conversation degree data storage unit 27 inside.
  • the conversation level update unit 26 may acquire the conversation level data from outside the speech recognition device 5 .
  • the conversation level update unit 26 updates the acquired conversation level data so as to increase the degree of conversation when the speaker presence/absence determination unit 14 determines that a speaker exists. If the speaker presence/absence determining unit 14 determines that there is no speaker, the conversation level updating unit 26 updates the acquired conversation level data so as to lower the level of conversation.
  • the conversation level data storage unit 27 is a storage medium that stores a plurality of conversation level data for each seat position. When there are a plurality of passengers in the vehicle, each passenger can become a speaker. Therefore, a plurality of conversation degree data for each seat position are stored for each passenger. Assume that the number of occupants present in the passenger compartment is, for example, three, and the three occupants are C1, C2, and C3.
  • the conversation degree data storage unit 27 stores the conversation degree data according to the pattern P1 and the conversation degree data according to the pattern P2 as the conversation degree data for the passenger C1.
  • the conversation degree data storage unit 27 stores the conversation degree data according to the pattern P1 and the conversation degree data according to the pattern P2 as the conversation degree data for the passenger C1.
  • the conversation degree data storage unit 27 stores the conversation degree data for the pattern P1, the conversation degree data for the pattern P2, and the conversation degree data for the pattern P3 as the conversation degree data for the passenger C2. data and conversation degree data relating to pattern P4.
  • the conversation degree update processing unit 28 acquires seat position data indicating the position of the seat where each passenger is seated from the occupant identification unit 12 and acquires speaker identification information from the speaker identification processing unit 13 .
  • the conversation degree update processing unit 28 selects the conversation degree data of the speaker indicated by the identification information from among the plurality of conversation degree data for each seat position stored in the conversation degree data storage unit 27, and the seat position data. Acquire the degree of conversation data for the seat position indicated by . If the determination result output from the speaker presence/absence determining unit 14 indicates that a speaker is present, the conversation level update processing unit 28 updates the acquired conversation level data so as to increase the level of conversation. Update.
  • the conversation level update processing unit 28 updates the acquired conversation level data so as to lower the level of conversation. Update.
  • the conversation level update processing unit 28 causes the conversation level data storage unit 27 to store the updated conversation level data.
  • the conversation level updating unit 26 is applied to the speech recognition device 5 shown in FIG.
  • the conversation level data storage unit 27 stores a plurality of conversation level data for each driving purpose and seat position.
  • the conversation degree update processing unit 28 updates the conversation degree data of the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the plurality of conversation degree data stored in the conversation degree data storage unit 27. Also, conversation degree data for the driving purpose indicated by the driving purpose data and at the seat position indicated by the seat position data is acquired.
  • each of the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation degree update unit 26, and the response unit 18, which are the constituent elements of the speech recognition device 5, are implemented by dedicated hardware as shown in FIG. Assuming it will be implemented. That is, it is assumed that the speech recognition device 5 is implemented by a speaker identification circuit 31, a speaker presence/absence determination circuit 32, a conversation degree update circuit 37, and a response circuit .
  • Each of the speaker identification circuit 31, the talker presence/absence determination circuit 32, the conversation degree update circuit 37, and the response circuit 34 is, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, ASIC, FPGA, Or a combination of these applies.
  • the components of the speech recognition device 5 are not limited to those realized by dedicated hardware, and the speech recognition device 5 may be realized by software, firmware, or a combination of software and firmware. good too.
  • the speech recognition device 5 is realized by software, firmware, or the like, there is a program for causing a computer to execute respective processing procedures in the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation level update unit 26, and the response unit 18.
  • a program is stored in the memory 41 shown in FIG.
  • the processor 42 shown in FIG. 3 executes the program stored in the memory 41 .
  • FIG. 9 shows an example in which each component of the speech recognition device 5 is realized by dedicated hardware
  • FIG. 3 shows an example in which the speech recognition device 5 is realized by software, firmware, or the like.
  • this is only an example, and some components of the speech recognition device 5 may be implemented by dedicated hardware, and the remaining components may be implemented by software, firmware, or the like.
  • the occupant identification unit 12 identifies the position where each occupant is seated, as in the first embodiment.
  • the occupant identification unit 12 outputs seat position data indicating the position of the seat on which each occupant is seated to the conversation degree update processing unit 28 .
  • the speaker identification processing unit 13 outputs speaker identification information to the conversation level update processing unit 28 .
  • the conversation degree update processing unit 28 acquires seat position data indicating the position of the seat where each passenger is seated from the occupant identification unit 12 and acquires speaker identification information from the speaker identification processing unit 13 .
  • the conversation degree update processing unit 28 selects the conversation degree data of the speaker indicated by the identification information from among the plurality of conversation degree data for each seat position stored in the conversation degree data storage unit 27, and the seat position data. Acquire the degree of conversation data for the seat position indicated by . Assume that the number of occupants present in the passenger compartment is, for example, three, and the three occupants are C1, C2, and C3.
  • the conversation degree data storage unit 27 stores the conversation degree data according to the pattern P1 and the conversation degree data according to the pattern P2 as the conversation degree data for the passenger C1.
  • the conversation degree update processing unit 28 acquires the conversation degree data related to the pattern P1 as the conversation degree data for the passenger C1.
  • the degree-of-conversation update processing unit 28 acquires from the speaker presence/absence determination unit 14 a determination result indicating whether or not there is a speaker who is having a conversation with the speaker.
  • the conversation degree update processing unit 28 updates the acquired conversation degree data so as to increase the degree K of conversation if the determination result indicates that there is a speaker.
  • the conversation level update processing unit 28 updates the acquired conversation level data so as to lower the conversation level K if the determination result indicates that there is no speaker.
  • the update processing of the conversation level data by the conversation level update processor 28 is the same as the update process of the conversation level data by the conversation level update processor 17 shown in FIG.
  • the conversation level update processing unit 28 causes the conversation level data storage unit 27 to store the updated conversation level data.
  • the space is a vehicle compartment
  • the plurality of users present in the space are the plurality of passengers in the vehicle.
  • the respective seat positions of the speaker and the person speaking are specified based on the indoor video or the sound inside the vehicle.
  • the conversation degree update unit 26 acquires conversation degree data for the seat position identified by the speaker identification unit 11 from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker, If the speaker presence/absence determination unit 14 determines that a speaker exists, the acquired conversation degree data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit 14 determines whether a speaker exists.
  • the speech recognition device 5 is configured so as to update the acquired conversation degree data so as to lower the degree of conversation if it is determined that there is no conversation. Therefore, the speech recognition device 5 shown in FIG. 8 has a higher probability than the speech recognition device 5 shown in FIG. can be reduced.
  • the present disclosure is suitable for speech recognition devices and speech recognition methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

This voice recognition device (5) is configured so as to have: a speaker identification unit (11) that identifies a speaker, who is a user who is speaking among a plurality of users present in a space, on the basis of an image in the space captured by a camera (1) or a sound in the space picked up by a microphone (2); and a response unit (18) that acquires conversation degree data indicating the degree of past conversations between the speaker and users other than the speaker identified by the speaker identification unit (11) among the plurality of users, determines whether or not the speech of the speaker is speech directed at the voice recognition device (5) on the basis of the conversation degree data, and generates response data to the speech of the speaker only when it is determined that the speech is directed to the voice recognition device (5).

Description

音声認識装置及び音声認識方法Speech recognition device and speech recognition method
 本開示は、音声認識装置及び音声認識方法に関するものである。 The present disclosure relates to a speech recognition device and a speech recognition method.
 ユーザの発話を音声認識する音声認識装置の中には、ユーザがウェイクアップワードを発話したときに限り、ユーザの発話の音声認識を開始する音声認識装置(以下「従来の音声認識装置」という)がある。ウェイクアップワードとは、音声認識の開始を指示する言葉である。従来の音声認識装置は、ユーザがウェイクアップワードを発話したときに限り、音声認識を開始することで、ユーザがウェイクアップワード以外のワードを発話したときに、音声認識を誤って開始する状況を回避できる。しかし、従来の音声認識装置に対して、音声認識を開始させる度に、ユーザがウェイクアップワードを発話しなければならないため、ユーザが煩わしさを感じることがある。 Some voice recognition devices that recognize user's utterances include a voice recognition device that starts voice recognition of user's utterances only when the user utters a wake-up word (hereinafter referred to as "conventional voice recognition device"). There is A wake-up word is a word that instructs the start of speech recognition. A conventional speech recognition apparatus starts speech recognition only when a user utters a wakeup word, thereby avoiding a situation in which speech recognition is erroneously started when a user utters a word other than the wakeup word. can be avoided. However, since the user has to utter a wake-up word every time the conventional speech recognition apparatus starts speech recognition, the user may feel annoyed.
 ユーザがウェイクアップワードを発話しなくても、ユーザの発話の音声認識を開始する音声認識装置が特許文献1に開示されている。当該音声認識装置では、応答判定ユニットが、ユーザの一連の発話の文脈に基づいて、ユーザの発話が、音声認識装置に対する発話であるか否かを判定するための指標である尤度スコアを算出している。また、当該音声認識装置では、前回発話しているユーザのIDと、今回発話しているユーザのIDとが異なる場合、尤度スコアを下げるようにしている。
 当該応答判定ユニットは、尤度スコアが閾値以上であれば、ユーザの発話が、音声認識装置に対する発話であると判定する。
Patent Document 1 discloses a speech recognition device that starts speech recognition of a user's utterance even if the user does not utter a wake-up word. In the speech recognition device, the response determination unit calculates a likelihood score, which is an index for determining whether or not the user's utterance is an utterance to the speech recognition device, based on the context of the user's series of utterances. is doing. Also, in the speech recognition apparatus, if the ID of the user who spoke last time is different from the ID of the user who is speaking this time, the likelihood score is lowered.
The response determination unit determines that the user's utterance is an utterance to the speech recognition device if the likelihood score is greater than or equal to a threshold.
特開2018-136568号公報JP 2018-136568 A
 特許文献1に開示されている音声認識装置では、前回発話しているユーザのIDと、今回発話しているユーザのIDとが異なる場合、尤度スコアを下げるようにしている。しかし、或るユーザが発話してから、別のユーザが直ちに発話する場合のほかに、或るユーザが発話してから、別のユーザが発話するまでに多くの時間を要する場合もあり、上記のように尤度スコアを下げることが適切な処理であるとは限らない。したがって、当該音声認識装置では、ユーザの発話が、音声認識装置に対する発話であるのか、2人のユーザの会話の一部であるのかを見分けることが困難であり、ユーザの発話が、音声認識装置に対する発話であるときに、ユーザ同士の会話であると誤認してしまうことがあるという課題があった。 The speech recognition device disclosed in Patent Document 1 lowers the likelihood score when the ID of the user who spoke last time is different from the ID of the user who is speaking this time. However, in addition to the case where another user speaks immediately after one user speaks, there are cases where it takes a long time for another user to speak after one user speaks. Lowering the likelihood score as in is not necessarily an appropriate process. Therefore, with this speech recognition device, it is difficult to distinguish whether the user's utterance is an utterance to the speech recognition device or is part of a conversation between two users. There is a problem that when the utterance is directed to the user, it may be misidentified as a conversation between users.
 本開示は、上記のような課題を解決するためになされたもので、ユーザの発話が、音声認識装置に対する発話であるときに、ユーザ同士の会話であると誤認する確率を、特許文献1に開示されている音声認識装置よりも低減することができる音声認識装置及び音声認識方法を得ることを目的とする。 The present disclosure has been made to solve the above problems. It is an object of the present invention to obtain a speech recognition device and a speech recognition method that can reduce the number of speech recognition devices compared to the disclosed speech recognition device.
 本開示に係る音声認識装置は、カメラにより撮影された空間内の映像、又は、マイクにより集音された空間内の音に基づいて、空間内に存在している複数のユーザの中で、発話しているユーザである発話者を特定する発話者特定部と、複数のユーザの中で、発話者特定部により特定された発話者以外のユーザと、発話者との過去の会話の度合を示す会話度合データを取得し、会話度合データに基づいて、発話者の発話が、音声認識装置に対する発話であるか否かを判定し、音声認識装置に対する発話であるとの判定を行ったときに限り、発話者の発話に対する応答データを生成する応答部とを備えるものである。 A speech recognition device according to the present disclosure recognizes an utterance among a plurality of users existing in a space based on an image in the space captured by a camera or sound in the space collected by a microphone. a speaker identification unit that identifies a speaker who is a user who is speaking, a user other than the speaker identified by the speaker identification unit among a plurality of users, and the degree of past conversation with the speaker Only when conversation degree data is acquired, and based on the conversation degree data, it is determined whether or not the speaker's utterance is for the speech recognition device, and it is determined that it is for the speech recognition device. , and a response unit that generates response data to the utterance of the speaker.
 本開示によれば、ユーザの発話が、音声認識装置に対する発話であるときに、ユーザ同士の会話であると誤認する確率を、特許文献1に開示されている音声認識装置よりも低減することができる。 According to the present disclosure, when a user's utterance is an utterance to a speech recognition device, the probability of misrecognizing that it is a conversation between users can be reduced more than the speech recognition device disclosed in Patent Document 1. can.
実施の形態1に係る音声認識装置5を示す構成図である。1 is a configuration diagram showing a speech recognition device 5 according to Embodiment 1; FIG. 実施の形態1に係る音声認識装置5のハードウェアを示すハードウェア構成図である。2 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to Embodiment 1. FIG. 音声認識装置5が、ソフトウェア又はファームウェア等によって実現される場合のコンピュータのハードウェア構成図である。2 is a hardware configuration diagram of a computer in which the speech recognition device 5 is implemented by software, firmware, or the like; FIG. 図1に示す音声認識装置5の処理手順である音声認識方法を示すフローチャートである。2 is a flowchart showing a speech recognition method, which is a processing procedure of the speech recognition device 5 shown in FIG. 1; 図1に示す音声認識装置5における会話度合更新部15の処理手順を示すフローチャートである。2 is a flow chart showing a processing procedure of a conversation degree updating unit 15 in the speech recognition device 5 shown in FIG. 1; 図1に示す音声認識装置5における会話度合更新部15の処理手順を示すフローチャートである。2 is a flow chart showing a processing procedure of a conversation degree updating unit 15 in the speech recognition device 5 shown in FIG. 1; 実施の形態2に係る音声認識装置5を示す構成図である。2 is a configuration diagram showing a speech recognition device 5 according to Embodiment 2. FIG. 実施の形態2に係る音声認識装置5のハードウェアを示すハードウェア構成図である。2 is a hardware configuration diagram showing hardware of a speech recognition device 5 according to Embodiment 2. FIG. 実施の形態3に係る音声認識装置5を示す構成図である。FIG. 11 is a configuration diagram showing a speech recognition device 5 according to Embodiment 3; 実施の形態3に係る音声認識装置5のハードウェアを示すハードウェア構成図である。FIG. 11 is a hardware configuration diagram showing hardware of a speech recognition device 5 according to Embodiment 3;
 以下、本開示をより詳細に説明するために、本開示を実施するための形態について、添付の図面に従って説明する。 Hereinafter, in order to describe the present disclosure in more detail, embodiments for carrying out the present disclosure will be described according to the attached drawings.
実施の形態1.
 図1は、実施の形態1に係る音声認識装置5を示す構成図である。
 図2は、実施の形態1に係る音声認識装置5のハードウェアを示すハードウェア構成図である。
 カメラ1は、例えば、赤外線カメラ、可視光カメラ、又は、紫外線カメラによって実現される。
 カメラ1は、空間内を撮影し、空間内の映像を示す映像データを音声認識装置5に出力する。
 マイク2は、当該空間内の音を集音し、空間内の音を示す音データを音声認識装置5に出力する。
Embodiment 1.
FIG. 1 is a configuration diagram showing a speech recognition device 5 according to Embodiment 1. As shown in FIG.
FIG. 2 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to the first embodiment.
Camera 1 is implemented by, for example, an infrared camera, a visible light camera, or an ultraviolet camera.
The camera 1 captures an image of the space, and outputs video data representing an image of the space to the speech recognition device 5 .
The microphone 2 collects sounds in the space and outputs sound data representing the sounds in the space to the speech recognition device 5 .
 実施の形態1では、当該空間が、車両の車室であるものとして説明する。したがって、当該空間内に存在している複数のユーザは、車両の乗員である。
 しかし、これは一例に過ぎず、当該空間が、例えば、建物内の部屋であってもよい。当該空間が建物内の部屋である場合、当該空間内に存在している複数のユーザは、部屋に居住している住人、又は、客人等である。
Embodiment 1 will be described assuming that the space is a compartment of a vehicle. Therefore, the multiple users present in the space are vehicle occupants.
However, this is only an example, and the space may be, for example, a room within a building. When the space is a room in a building, the users existing in the space are residents, guests, or the like living in the room.
 車載センサ3は、例えば、複数の座席のそれぞれに設置されている加圧センサ、車両に設置されている赤外線センサ、車両に設置されているGPS(Global Positioning System)センサ、又は、車両に設置されているジャイロセンサによって実現される。
 車載センサ3が例えば複数の加圧センサによって実現されている場合、複数の加圧センサのうち、乗員の重みを感知している加圧センサが感知信号を音声認識装置5に出力する。
 車載センサ3が例えばGPSセンサによって実現されている場合、車載センサ3は、車両が走行している位置を示す走行位置データを音声認識装置5に出力する。
The in-vehicle sensor 3 is, for example, a pressure sensor installed in each of a plurality of seats, an infrared sensor installed in the vehicle, a GPS (Global Positioning System) sensor installed in the vehicle, or a sensor installed in the vehicle. It is realized by a gyro sensor that is
For example, when the vehicle-mounted sensor 3 is implemented by a plurality of pressure sensors, the pressure sensor that senses the weight of the occupant among the plurality of pressure sensors outputs a sensing signal to the voice recognition device 5 .
When the vehicle-mounted sensor 3 is realized by, for example, a GPS sensor, the vehicle-mounted sensor 3 outputs running position data indicating the position where the vehicle is running to the voice recognition device 5 .
 ナビゲーション装置4は、車両に設置されている車載機器、又は、ユーザによって車両内に持ち込まれたスマートフォン等のデバイスである。
 ナビゲーション装置4は、例えば、目的地までの経路を案内するナビゲーション機能を有している。
 ナビゲーション装置4は、例えば、目的地までの経路を示す経路データを音声認識装置5に出力する。
The navigation device 4 is an in-vehicle device installed in the vehicle, or a device such as a smart phone brought into the vehicle by the user.
The navigation device 4 has, for example, a navigation function of guiding a route to a destination.
The navigation device 4 outputs, for example, route data indicating a route to a destination to the voice recognition device 5 .
 音声認識装置5は、発話者特定部11、会話者有無判定部14、会話度合更新部15及び応答部18を備えている。
 音声認識装置5は、ユーザである乗員の発話を音声認識し、乗員の発話が、音声認識装置5に対する発話であるか否かを判定する。
 音声認識装置5は、音声認識装置5に対する発話であるとの判定を行ったときに限り、発話者の発話に対する応答データを生成する。
 音声認識装置5は、応答データを車載機器6及び出力装置7のそれぞれに出力する。
The speech recognition device 5 includes a speaker identification unit 11 , a speaker presence/absence determination unit 14 , a conversation degree update unit 15 and a response unit 18 .
The voice recognition device 5 recognizes the voice of an occupant who is a user, and determines whether or not the voice of the occupant is directed to the voice recognition device 5 .
The speech recognition device 5 generates response data to the utterance of the speaker only when it determines that the speech is directed to the speech recognition device 5 .
The voice recognition device 5 outputs the response data to the vehicle-mounted device 6 and the output device 7, respectively.
 車載機器6は、例えば、ナビゲーション装置、エアコン機器、又は、オーディオ機器である。
 車載機器6は、音声認識装置5から出力された応答データに従って動作する。
 出力装置7は、例えば、ディスプレイ、照明機器、又は、スピーカである。
 出力装置7は、音声認識装置5から出力された応答データに従って動作する。
The in-vehicle device 6 is, for example, a navigation device, an air conditioner device, or an audio device.
The vehicle-mounted device 6 operates according to the response data output from the speech recognition device 5 .
The output device 7 is, for example, a display, a lighting device, or a speaker.
The output device 7 operates according to the response data output from the speech recognition device 5 .
 発話者特定部11は、例えば、図2に示す発話者特定回路31によって実現される。
 発話者特定部11は、乗員特定部12及び発話者特定処理部13を備えている。
 発話者特定部11は、カメラ1により撮影された空間である車室内の映像、又は、マイク2により集音された車室内の音に基づいて、車室内に存在している複数の乗員の中で、発話している乗員である発話者を特定する。
 発話者特定部11は、特定した発話者を示す発話者データを会話者有無判定部14に出力する。
The speaker identification unit 11 is implemented by, for example, a speaker identification circuit 31 shown in FIG.
The speaker identification unit 11 includes an occupant identification unit 12 and a speaker identification processing unit 13 .
The speaker identification unit 11 detects the presence of a plurality of occupants present in the vehicle based on the image of the vehicle interior, which is the space captured by the camera 1, or the sound of the vehicle interior collected by the microphone 2. to identify the speaker who is the passenger who is speaking.
The speaker identification unit 11 outputs speaker data indicating the identified speaker to the speaker presence/absence determination unit 14 .
 乗員特定部12は、車載センサ3に含まれている複数の加圧センサのうち、乗員の重みを感知している加圧センサから感知信号を取得する。
 乗員特定部12は、感知信号を取得することによって、複数の加圧センサの中で、感知信号を出力している加圧センサを特定し、特定した加圧センサが設置されている座席に乗員が座っていると判断する。
 乗員特定部12は、カメラ1から出力された映像データを取得する。
 乗員特定部12は、映像データが示す車室内の映像のうち、それぞれの乗員が座っている座席を含む領域の映像の中から、それぞれの乗員の顔が映っている映像(以下「顔画像」という)の切り出しを行う。
 乗員特定部12は、それぞれの顔画像を解析することによって、それぞれの乗員の個人認証を実施し、それぞれの乗員の識別情報を発話者特定処理部13に出力する。
 また、乗員特定部12は、それぞれの乗員の顔画像を発話者特定処理部13に出力する。
 なお、乗員特定部12は、後述するように、カメラ1から出力された映像データに基づいて、それぞれの乗員が座っている座席の位置を特定するようにしてもよい。
The occupant identifying unit 12 acquires a sensing signal from a pressure sensor that senses the weight of the occupant, among a plurality of pressure sensors included in the vehicle-mounted sensor 3 .
By acquiring the sensing signal, the occupant identifying unit 12 identifies the pressure sensor outputting the sensing signal among the plurality of pressure sensors, and places the occupant in the seat where the identified pressure sensor is installed. is sitting.
The occupant identification unit 12 acquires video data output from the camera 1 .
The occupant identification unit 12 selects an image showing the face of each occupant (hereinafter referred to as a “face image”) from an image of an area including a seat on which each occupant is seated, among images of the interior of the vehicle indicated by the image data. ) is cut out.
The occupant identification unit 12 analyzes each face image to perform personal authentication of each occupant, and outputs identification information of each occupant to the speaker identification processing unit 13 .
The occupant identification unit 12 also outputs the face image of each occupant to the speaker identification processing unit 13 .
Note that the occupant identification unit 12 may identify the position of the seat on which each occupant sits, based on the video data output from the camera 1, as will be described later.
 図1に示す音声認識装置5では、乗員特定部12が、車室内の映像に基づいて、それぞれの乗員の個人認証を実施している。しかし、これは一例に過ぎず、車室内に存在している全ての乗員が声を発していれば、乗員特定部12が、音データが示す車室内の音に基づいて、それぞれの乗員の個人認証を実施するようにしてもよい。
 即ち、乗員特定部12は、音データが示す車室内の音の中から、それぞれの乗員の音声を抽出し、それぞれの音声の声紋認証を行うことによって、それぞれの乗員の個人認証を行う。車室内の音の中には、乗員の音声のほかに、車両の走行音、エアコンにおける冷風又は温風の吐出し音、車外の騒音、又は、オーディオ機器による音楽の再生音等が含まれている。
In the speech recognition device 5 shown in FIG. 1, the occupant identification unit 12 performs personal authentication of each occupant based on the video inside the vehicle. However, this is only an example, and if all the passengers present in the vehicle are uttering their voices, the passenger identifying unit 12 will identify each passenger based on the sound in the vehicle indicated by the sound data. Authentication may be performed.
That is, the occupant identification unit 12 extracts the voice of each occupant from the sound inside the vehicle indicated by the sound data, and performs voiceprint authentication of each voice, thereby performing personal authentication of each occupant. In addition to the voices of the passengers, the sounds in the cabin include the running sound of the vehicle, the sound of cold or warm air discharged from the air conditioner, the noise outside the vehicle, or the sound of music being played by audio equipment. there is
 発話者特定処理部13は、乗員特定部12から、それぞれの乗員の識別情報と、それぞれの乗員の顔画像とを取得する。
 発話者特定処理部13は、それぞれの顔画像を解析することによって、口が動いている乗員を特定する。
 発話者特定処理部13は、口が動いている乗員が発話者であるとして、発話者の識別情報を会話者有無判定部14、会話度合更新処理部17及び応答適否判定部19のそれぞれに出力する。
The speaker identification processing unit 13 acquires the identification information of each passenger and the face image of each passenger from the passenger identification unit 12 .
The speaker identification processing unit 13 identifies the occupant whose mouth is moving by analyzing each face image.
The speaker identification processing unit 13 assumes that the passenger whose mouth is moving is the speaker, and outputs the identification information of the speaker to each of the speaker presence/absence determination unit 14, the conversation degree update processing unit 17, and the response appropriateness determination unit 19. do.
 図1に示す音声認識装置5では、発話者特定処理部13が、車室内の映像に基づいて、発話者を特定している。しかし、これは一例に過ぎず、発話者特定処理部13が、マイク2により集音された車室内の音に基づいて、発話者を特定するようにしてもよい。
 即ち、発話者特定処理部13は、例えば、それぞれの座席にマイク2が設置されている場合、複数のマイク2の中で、最も大きな声を集音したマイク2が設置されている座席に座っている乗員が、発話者であると特定するようにしてもよい。
 例えば、車室内に1つのマイク2が設置されている場合、発話者特定処理部13は、マイク2に対する声の到来方向から、発話者を特定するようにしてもよい。これらの場合、発話者特定処理部13は、特定した発話者の音声を示す音データを会話者有無判定部14に出力する。
In the speech recognition device 5 shown in FIG. 1, the speaker identification processing unit 13 identifies the speaker based on the image inside the vehicle. However, this is only an example, and the speaker identification processing unit 13 may identify the speaker based on the sound inside the vehicle collected by the microphone 2 .
That is, for example, when the microphones 2 are installed in each seat, the speaker identification processing unit 13 selects the seat where the microphone 2 that collects the loudest voice among the plurality of microphones 2 is installed. You may make it specify that the passenger|crew who is calling is a speaker.
For example, when one microphone 2 is installed in the vehicle interior, the speaker identification processing unit 13 may identify the speaker from the incoming direction of the voice to the microphone 2 . In these cases, the speaker identification processing unit 13 outputs sound data representing the voice of the identified speaker to the speaker presence/absence determination unit 14 .
 会話者有無判定部14は、例えば、図2に示す会話者有無判定回路32によって実現される。
 会話者有無判定部14は、発話者特定部11により特定された発話者の人数に基づいて、複数の乗員の中に、発話者特定部11により特定された発話者と会話している乗員である会話者が存在しているか否かを判定する。
 即ち、会話者有無判定部14は、発話者特定処理部13から、複数の発話者におけるそれぞれの識別情報を取得すれば、会話者が存在していると判定する。
 会話者有無判定部14は、発話者と会話している会話者が存在しているか否かを示す判定結果を会話度合更新部15に出力する。
The talker presence/absence determination unit 14 is realized by, for example, a talker presence/absence determination circuit 32 shown in FIG.
Based on the number of speakers identified by the speaker identification unit 11, the speaker presence/absence determination unit 14 selects, from among a plurality of passengers, passengers who are conversing with the speaker identified by the speaker identification unit 11. Determine if a certain talker is present.
That is, when the speaker identification processing unit 13 acquires the identification information of each of the plurality of speakers, the speaker presence/absence determination unit 14 determines that the speaker exists.
The talker presence/absence determination unit 14 outputs to the conversation degree update unit 15 a determination result indicating whether or not there is a talker who is having a conversation with the speaker.
 会話度合更新部15は、例えば、図2に示す会話度合更新回路33によって実現される。
 会話度合更新部15は、会話度合データ記憶部16及び会話度合更新処理部17を備えている。
 会話度合更新部15は、発話者についての過去の会話の度合を示す会話度合データを取得する。
 図1に示す音声認識装置5では、会話度合更新部15が、内部の会話度合データ記憶部16から、会話度合データを取得している。しかし、これは一例に過ぎず、会話度合更新部15が、音声認識装置5の外部から会話度合データを取得するようにしてもよい。
 会話度合更新部15は、会話者有無判定部14により会話者が存在していると判定されれば、会話の度合を高めるように、取得した会話度合データを更新する。
 会話度合更新部15は、会話者有無判定部14により会話者が存在していないと判定されれば、会話の度合を低めるように、取得した会話度合データを更新する。
 ここで、会話の度合は、発話者と同乗者との会話の頻度、会話回数、又は、会話時間等である。
The conversation level update unit 15 is realized by, for example, the conversation level update circuit 33 shown in FIG.
The conversation level update unit 15 includes a conversation level data storage unit 16 and a conversation level update processing unit 17 .
The conversation degree update unit 15 acquires conversation degree data indicating the past conversation degree of the speaker.
In the speech recognition apparatus 5 shown in FIG. 1, the conversation degree update unit 15 acquires conversation degree data from the conversation degree data storage unit 16 inside. However, this is merely an example, and the conversation level update unit 15 may acquire the conversation level data from outside the speech recognition device 5 .
Conversation degree update unit 15 updates the acquired conversation degree data so as to increase the degree of conversation when it is determined by talker presence/absence determination unit 14 that a talker exists.
The conversation degree update unit 15 updates the acquired conversation degree data so as to lower the degree of conversation when the conversation person presence/absence determination unit 14 determines that there is no conversation person.
Here, the degree of conversation is the frequency of conversation between the speaker and the fellow passenger, the number of conversations, or the conversation time.
 会話度合データ記憶部16は、車室内に存在している複数の乗員の中で、発話者以外の乗員と発話者との過去の会話の度合を示す会話度合データを記憶している記憶媒体である。車室内に複数の乗員が存在している場合、それぞれの乗員が発話者になり得るため、それぞれの乗員についての過去の会話度合データを記憶している。即ち、車室内に存在している乗員が例えば2人であれば、会話度合データ記憶部16は、2つの会話度合データを記憶しており、車室内に存在している乗員が例えば3人であれば、会話度合データ記憶部16は、3つの会話度合データを記憶している。 The conversation degree data storage unit 16 is a storage medium that stores conversation degree data indicating the degree of past conversation between a speaker and a passenger other than the speaker among the plurality of passengers present in the vehicle. be. When there are a plurality of passengers in the vehicle, each passenger can be a speaker, so the past conversation degree data for each passenger is stored. That is, if there are, for example, two passengers in the vehicle, the conversation degree data storage unit 16 stores two pieces of conversation degree data, and if there are, for example, three passengers in the vehicle. If there is, the conversation level data storage unit 16 stores three pieces of conversation level data.
 会話度合更新処理部17は、会話度合データ記憶部16に記憶されている会話度合データの中から、発話者特定処理部13から出力された識別情報が示す発話者についての会話度合データを取得する。
 会話度合更新処理部17は、会話者有無判定部14から出力された判定結果が、会話者が存在している旨を示していれば、会話の度合を高めるように、取得した会話度合データを更新する。
 会話度合更新処理部17は、会話者有無判定部14から出力された判定結果が、会話者が存在していない旨を示していれば、会話の度合を低めるように、取得した会話度合データを更新する。
 会話度合更新処理部17は、更新後の会話度合データを会話度合データ記憶部16に記憶させる。
The conversation degree update processing unit 17 acquires the conversation degree data for the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the conversation degree data stored in the conversation degree data storage unit 16. .
If the determination result output from the speaker presence/absence determining unit 14 indicates that a speaker is present, the conversation level update processing unit 17 updates the acquired conversation level data so as to increase the level of conversation. Update.
If the determination result output from the speaker presence/absence determining unit 14 indicates that the speaker does not exist, the conversation level update processing unit 17 updates the acquired conversation level data so as to lower the level of conversation. Update.
The conversation level update processing unit 17 stores the updated conversation level data in the conversation level data storage unit 16 .
 応答部18は、例えば、図2に示す応答回路34によって実現される。
 応答部18は、応答適否判定部19、音声認識部20及び応答データ生成部21を備えている。
 応答部18は、車室内に存在している複数の乗員の中で、発話者特定部11により特定された発話者以外の乗員と、発話者との過去の会話の度合を示す会話度合データを取得する。
 応答部18は、取得した会話度合データに基づいて、発話者の発話が、音声認識装置5に対する発話であるか否かを判定する。
 応答部18は、発話者の発話が、音声認識装置5に対する発話であるとの判定を行ったときに限り、発話者の発話に対する応答データを生成する。
 発話者以外の乗員と、発話者との過去の会話の度合が高いほど、発話者の発話は、発話者以外の乗員との会話の一部である可能性が高く、音声認識装置に対する発話の可能性が低いことが想定される。一方、発話者以外の乗員と、発話者との過去の会話の度合が低いほど、発話者の発話は、発話者以外の乗員との会話の一部である可能性が低く、音声認識装置に対する発話の可能性が高いことが想定される。
The response unit 18 is implemented by, for example, a response circuit 34 shown in FIG.
The response unit 18 includes a response appropriateness determination unit 19 , a voice recognition unit 20 and a response data generation unit 21 .
The response unit 18 collects conversation degree data indicating the degree of past conversations with the speaker other than the speaker identified by the speaker identification unit 11 among the plurality of passengers present in the vehicle. get.
The response unit 18 determines whether or not the speech of the speaker is directed to the speech recognition device 5 based on the acquired conversation degree data.
The response unit 18 generates response data to the utterance of the utterer only when it is determined that the utterance of the utterer is directed to the speech recognition device 5 .
The higher the degree of past conversation between a passenger other than the speaker and the speaker, the higher the possibility that the speaker's utterance is part of the conversation with the passenger other than the speaker, and the higher the probability that the speaker's utterance is part of the conversation with the passenger other than the speaker. It is assumed that the possibility is low. On the other hand, the lower the degree of past conversation between the speaker and the passenger other than the speaker, the lower the possibility that the speaker's speech is part of the conversation with the passenger other than the speaker. It is assumed that the possibility of utterance is high.
 応答適否判定部19は、会話度合データ記憶部16に記憶されている会話度合データの中から、発話者特定処理部13から出力された識別情報が示す発話者についての会話度合データを取得する。
 応答適否判定部19は、カメラ1により撮影された空間である車室内の映像、マイク2により集音された車室内の音、車載センサ3から出力された走行位置データ及びナビゲーション装置4から出力された経路データ等に基づいて、発話者の発話が、音声認識装置5に対する発話であるか否かを判定するための指標である応答度合を算出する。
 応答適否判定部19は、取得した会話度合データが示す会話の度合に基づいて、応答度合を補正する。補正後の応答度合は、会話度合データが示す会話の度合が大きいほど小さい。
 応答適否判定部19は、補正後の応答度合が第1の閾値以上であれば、発話者の発話が、音声認識装置5に対する発話であるとの判定を行う。
 応答適否判定部19は、補正後の応答度合が第1の閾値未満であれば、発話者の発話が、音声認識装置5に対する発話ではないとの判定を行う。第1の閾値は、応答適否判定部19の内部メモリに格納されていてもよいし、音声認識装置5の外部から与えられるものであってもよい。
The response propriety determination unit 19 acquires conversation degree data for the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the conversation degree data stored in the conversation degree data storage unit 16.
The response propriety determination unit 19 includes the image of the interior of the vehicle, which is the space captured by the camera 1, the sound of the interior of the vehicle collected by the microphone 2, the traveling position data output from the vehicle-mounted sensor 3, and the data output from the navigation device 4. A degree of response, which is an index for determining whether or not the utterance of the utterer is directed to the speech recognition device 5, is calculated based on the obtained route data.
The response propriety determination unit 19 corrects the response degree based on the degree of conversation indicated by the acquired conversation degree data. The degree of response after correction decreases as the degree of conversation indicated by the degree of conversation data increases.
If the degree of response after correction is equal to or greater than the first threshold, the response suitability determination unit 19 determines that the utterance of the utterer is directed to the speech recognition device 5 .
If the degree of response after correction is less than the first threshold, the response suitability determination unit 19 determines that the utterance of the utterer is not the utterance to the speech recognition device 5 . The first threshold value may be stored in the internal memory of the response appropriateness determination unit 19 or may be given from the outside of the speech recognition device 5 .
 図1に示す音声認識装置5では、応答適否判定部19が、補正後の応答度合が第1の閾値以上であれば、発話者の発話が、音声認識装置5に対する発話であるとの判定を行うようにしている。しかし、これは一例に過ぎず、応答適否判定部19が、取得した会話度合データが示す会話の度合が第2の閾値以下であれば、発話者の発話が、音声認識装置5に対する発話であるとの判定を行い、取得した会話度合データが示す会話の度合が第2の閾値よりも大きければ、発話者の発話が、音声認識装置5に対する発話ではないとの判定を行うようにしてもよい。第2の閾値は、応答適否判定部19の内部メモリに格納されていてもよいし、音声認識装置5の外部から与えられるものであってもよい。 In the speech recognition device 5 shown in FIG. 1, the response propriety determination unit 19 determines that the utterance of the speaker is directed to the speech recognition device 5 if the degree of response after correction is equal to or greater than the first threshold. I am trying to do it. However, this is only an example, and if the degree of conversation indicated by the obtained conversation degree data is equal to or less than the second threshold, the response propriety determination unit 19 determines that the utterance of the speaker is directed to the speech recognition device 5. If the degree of conversation indicated by the acquired degree of conversation data is greater than the second threshold value, it may be determined that the utterance of the speaker is not the utterance to the speech recognition device 5. . The second threshold value may be stored in the internal memory of the response appropriateness determination unit 19 or may be given from the outside of the speech recognition device 5 .
 音声認識部20は、マイク2により集音された車室内の音を音声認識し、音声認識結果を示す音声認識結果データを応答データ生成部21に出力する。音声認識結果データとしては、テキストデータ、又は、音声データが考えられる。
 応答データ生成部21は、応答適否判定部19によって、発話者の発話が、音声認識装置5に対する発話であるとの判定が行われていれば、音声認識部20から出力された音声認識結果データに対する応答データを生成する。
 応答データ生成部21は、応答データを車載機器6及び出力装置7のそれぞれに出力する。
The voice recognition unit 20 performs voice recognition on the sound in the vehicle compartment collected by the microphone 2 and outputs voice recognition result data indicating the voice recognition result to the response data generation unit 21 . The speech recognition result data may be text data or voice data.
If the response propriety determination unit 19 determines that the utterance of the speaker is directed to the voice recognition device 5, the response data generation unit 21 generates the voice recognition result data output from the voice recognition unit 20. Generate response data for
The response data generator 21 outputs the response data to the vehicle-mounted device 6 and the output device 7, respectively.
 図1に示す音声認識装置5は、構成要素として、発話者特定部11、会話者有無判定部14、会話度合更新部15及び応答部18を備えている。音声認識装置5におけるいずれかの構成要素が、ネットワークに接続されているサーバ装置、又は、携帯端末等に分散配置されていてもよい。いずれかの構成要素がサーバ装置等に分散配置されている場合、音声認識装置5は、当該構成要素に与えられるデータ等をサーバ装置等に送信し、当該構成要素から出力されるデータ等を受信する送受信部を備えている必要がある。 The speech recognition device 5 shown in FIG. 1 includes a speaker identification unit 11, a speaker presence/absence determination unit 14, a conversation degree update unit 15, and a response unit 18 as components. Any component of the speech recognition device 5 may be distributed to a server device connected to a network, a mobile terminal, or the like. If any component is distributed to a server device or the like, the speech recognition device 5 transmits data or the like given to the component to the server device or the like, and receives data or the like output from the component. It must be equipped with a transmitter/receiver that
 図1では、音声認識装置5の構成要素である発話者特定部11、会話者有無判定部14、会話度合更新部15及び応答部18のそれぞれが、図2に示すような専用のハードウェアによって実現されるものを想定している。即ち、音声認識装置5が、発話者特定回路31、会話者有無判定回路32、会話度合更新回路33及び応答回路34によって実現されるものを想定している。
 発話者特定回路31、会話者有無判定回路32、会話度合更新回路33及び応答回路34のそれぞれは、例えば、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ASIC(Application Specific Integrated Circuit)、FPGA(Field-Programmable Gate Array)、又は、これらを組み合わせたものが該当する。
In FIG. 1, each of the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation level update unit 15, and the response unit 18, which are the constituent elements of the speech recognition device 5, are implemented by dedicated hardware as shown in FIG. Assuming it will be implemented. That is, it is assumed that the speech recognition device 5 is realized by a speaker identification circuit 31, a speaker presence/absence determination circuit 32, a conversation level update circuit 33, and a response circuit .
Each of the speaker identification circuit 31, the talker presence/absence determination circuit 32, the conversation degree update circuit 33, and the response circuit 34 is, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or a combination thereof.
 音声認識装置5の構成要素は、専用のハードウェアによって実現されるものに限るものではなく、音声認識装置5が、ソフトウェア、ファームウェア、又は、ソフトウェアとファームウェアとの組み合わせによって実現されるものであってもよい。
 ソフトウェア又はファームウェアは、プログラムとして、コンピュータのメモリに格納される。コンピュータは、プログラムを実行するハードウェアを意味し、例えば、CPU(Central Processing Unit)、中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、プロセッサ、あるいは、DSP(Digital Signal Processor)が該当する。
The components of the speech recognition device 5 are not limited to those realized by dedicated hardware, and the speech recognition device 5 may be realized by software, firmware, or a combination of software and firmware. good too.
Software or firmware is stored as a program in a computer's memory. A computer means hardware that executes a program, for example, a CPU (Central Processing Unit), a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, a processor, or a DSP (Digital Signal Processor). do.
 図3は、音声認識装置5が、ソフトウェア又はファームウェア等によって実現される場合のコンピュータのハードウェア構成図である。
 音声認識装置5が、ソフトウェア又はファームウェア等によって実現される場合、発話者特定部11、会話者有無判定部14、会話度合更新部15及び応答部18におけるそれぞれの処理手順をコンピュータに実行させるためのプログラムがメモリ41に格納される。そして、コンピュータのプロセッサ42がメモリ41に格納されているプログラムを実行する。
FIG. 3 is a hardware configuration diagram of a computer when the speech recognition device 5 is implemented by software, firmware, or the like.
When the speech recognition device 5 is realized by software, firmware, or the like, there is a program for causing a computer to execute respective processing procedures in the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation degree update unit 15, and the response unit 18. A program is stored in the memory 41 . Then, the processor 42 of the computer executes the program stored in the memory 41 .
 また、図2では、音声認識装置5の構成要素のそれぞれが専用のハードウェアによって実現される例を示し、図3では、音声認識装置5がソフトウェア又はファームウェア等によって実現される例を示している。しかし、これは一例に過ぎず、音声認識装置5における一部の構成要素が専用のハードウェアによって実現され、残りの構成要素がソフトウェア又はファームウェア等によって実現されるものであってもよい。 Further, FIG. 2 shows an example in which each component of the speech recognition device 5 is realized by dedicated hardware, and FIG. 3 shows an example in which the speech recognition device 5 is realized by software, firmware, or the like. . However, this is only an example, and some components of the speech recognition device 5 may be implemented by dedicated hardware, and the remaining components may be implemented by software, firmware, or the like.
 次に、図1に示す音声認識装置5の動作について説明する。
 図4は、図1に示す音声認識装置5の処理手順である音声認識方法を示すフローチャートである。
 図5A及び図5Bは、図1に示す音声認識装置5における会話度合更新部15の処理手順を示すフローチャートである。
Next, the operation of the speech recognition device 5 shown in FIG. 1 will be described.
FIG. 4 is a flow chart showing a speech recognition method, which is a processing procedure of the speech recognition device 5 shown in FIG.
5A and 5B are flow charts showing the processing procedure of the conversation level updating unit 15 in the speech recognition device 5 shown in FIG.
 カメラ1は、車両の車室内を撮影し、車室内の映像を示す映像データを乗員特定部12、会話者有無判定部14及び応答適否判定部19のそれぞれに出力する。
 マイク2は、車室内の音を集音し、車室内の音を示す音データを乗員特定部12、会話者有無判定部14、応答適否判定部19及び音声認識部20のそれぞれに出力する。
 車載センサ3は、加圧センサ、赤外線センサ、GPSセンサ、又は、ジャイロセンサ等を含んでおり、センシング結果を示すセンサ情報を乗員特定部12及び応答適否判定部19のそれぞれに出力する。
 ナビゲーション装置4は、目的地を示す設定データのほか、目的地までの経路を示す経路データ、音声案内情報等を応答適否判定部19に出力する。
The camera 1 captures the interior of the vehicle, and outputs image data showing the image of the interior of the vehicle to the occupant identification unit 12, the speaker presence/absence determination unit 14, and the response propriety determination unit 19, respectively.
The microphone 2 collects sounds in the vehicle interior and outputs sound data representing the sounds in the vehicle interior to the occupant identification unit 12, the speaker presence/absence determination unit 14, the response appropriateness determination unit 19, and the voice recognition unit 20, respectively.
The in-vehicle sensor 3 includes a pressurization sensor, an infrared sensor, a GPS sensor, a gyro sensor, or the like, and outputs sensor information indicating sensing results to the occupant identification unit 12 and the response propriety determination unit 19, respectively.
The navigation device 4 outputs setting data indicating the destination, route data indicating the route to the destination, voice guidance information, and the like to the response propriety determination unit 19 .
 乗員特定部12は、車載センサ3に含まれている複数の加圧センサのうち、乗員の重みを感知している加圧センサから感知信号を取得する。
 乗員特定部12は、感知信号を取得することによって、複数の加圧センサの中で、感知信号を出力している加圧センサを特定する。
 乗員特定部12は、特定した加圧センサが設置されている座席に乗員が座っていると判断する。特定した加圧センサが例えば運転席に設置されていれば、乗員特定部12は、乗員が運転席に座っていると判断する。また、特定した加圧センサが例えば助手席に設置されていれば、乗員特定部12は、乗員が助手席に座っていると判断する。
The occupant identifying unit 12 acquires a sensing signal from a pressure sensor that senses the weight of the occupant, among a plurality of pressure sensors included in the vehicle-mounted sensor 3 .
By acquiring the sensing signal, the occupant identification unit 12 identifies the pressure sensor outputting the sensing signal among the plurality of pressure sensors.
The occupant identification unit 12 determines that the occupant is sitting in the seat where the identified pressure sensor is installed. If the identified pressurization sensor is installed, for example, in the driver's seat, the occupant identifying unit 12 determines that the occupant is sitting in the driver's seat. If the specified pressure sensor is installed, for example, in the front passenger seat, the occupant specifying unit 12 determines that the passenger is sitting in the front passenger seat.
 次に、乗員特定部12は、カメラ1から出力された映像データを取得する。
 乗員特定部12は、映像データが示す車室内の映像のうち、それぞれの乗員が座っている座席を含む領域の映像の中から、それぞれの乗員の顔が映っている映像である顔画像の切り出しを行う。
 乗員特定部12は、それぞれの顔画像を解析することによって、それぞれの乗員の個人認証を実施し、それぞれの乗員の識別情報を発話者特定処理部13に出力する(図4のステップST1)。
 また、乗員特定部12は、それぞれの乗員の顔画像を発話者特定処理部13に出力する。
Next, the occupant identification unit 12 acquires video data output from the camera 1 .
The occupant identification unit 12 cuts out a face image, which is an image showing the face of each occupant, from among the images of the interior of the vehicle indicated by the video data, from the video of the area including the seat on which each occupant is seated. I do.
The occupant identification unit 12 analyzes each face image to perform personal authentication of each occupant, and outputs identification information of each occupant to the speaker identification processing unit 13 (step ST1 in FIG. 4).
The occupant identification unit 12 also outputs the face image of each occupant to the speaker identification processing unit 13 .
 図1に示す音声認識装置5では、乗員特定部12が、車室内の映像に基づいて、それぞれの乗員の個人認証を実施している。しかし、これは一例に過ぎず、車室内に存在している全ての乗員が声を発していれば、乗員特定部12が、マイク2から出力された音データが示す車室内の音に基づいて、それぞれの乗員の個人認証を実施するようにしてもよい。
 即ち、乗員特定部12は、音データが示す車室内の音の中から、それぞれの乗員の音声を抽出し、それぞれの音声の声紋認証を行うことによって、それぞれの乗員の個人認証を行うようにしてもよい。
In the speech recognition device 5 shown in FIG. 1, the occupant identification unit 12 performs personal authentication of each occupant based on the video inside the vehicle. However, this is only an example, and if all the passengers present in the vehicle are vocalizing, the passenger identification unit 12 will determine the sound based on the sound in the vehicle indicated by the sound data output from the microphone 2. , personal authentication of each passenger may be performed.
That is, the occupant identification unit 12 extracts the voice of each occupant from the sound in the vehicle interior indicated by the sound data, and performs voiceprint authentication of each voice, thereby performing personal authentication of each occupant. may
 発話者特定処理部13は、乗員特定部12から、それぞれの乗員の識別情報と、それぞれの乗員の顔画像とを取得する。
 発話者特定処理部13は、それぞれの顔画像を解析することによって、口が動いている乗員を探索し、口が動いている乗員が発話者であると特定する(図4のステップST2)。
The speaker identification processing unit 13 acquires the identification information of each passenger and the face image of each passenger from the passenger identification unit 12 .
The speaker identification processing unit 13 searches for the passenger whose mouth is moving by analyzing each face image, and identifies the passenger whose mouth is moving as the speaker (step ST2 in FIG. 4).
 図1に示す音声認識装置5では、発話者特定処理部13が、車室内の映像に基づいて、発話者を特定している。しかし、これは一例に過ぎず、発話者特定処理部13が、マイク2から出力された音データが示す車室内の音に基づいて、発話者を特定するようにしてもよい。
 即ち、発話者特定処理部13は、例えば、それぞれの座席にマイク2が設置されている場合、複数のマイク2の中で、最も大きな声を集音したマイク2が設置されている座席に座っている乗員が、発話者であると特定するようにしてもよい。
 発話者特定処理部13は、例えば、車室内に1つのマイク2が設置されている場合、マイク2に対する声の到来方向から、発話者を特定するようにしてもよい。
In the speech recognition device 5 shown in FIG. 1, the speaker identification processing unit 13 identifies the speaker based on the image inside the vehicle. However, this is only an example, and the speaker identification processing unit 13 may identify the speaker based on the sound inside the vehicle indicated by the sound data output from the microphone 2 .
That is, for example, when the microphones 2 are installed in each seat, the speaker identification processing unit 13 selects the seat where the microphone 2 that collects the loudest voice among the plurality of microphones 2 is installed. You may make it specify that the passenger|crew who is calling is a speaker.
For example, when one microphone 2 is installed in the vehicle interior, the speaker identification processing unit 13 may identify the speaker from the incoming direction of the voice to the microphone 2 .
 発話者特定処理部13は、発話者を特定すると、発話者の識別情報を会話者有無判定部14、会話度合更新処理部17及び応答適否判定部19のそれぞれに出力する。
 発話者特定処理部13は、複数の発話者を特定すれば、それぞれの発話者の識別情報を会話者有無判定部14、会話度合更新処理部17及び応答適否判定部19のそれぞれに出力する。
After identifying the speaker, the speaker identification processing unit 13 outputs the identification information of the speaker to the speaker presence/absence determination unit 14, the conversation degree update processing unit 17, and the response appropriateness determination unit 19, respectively.
If speaker identification processing unit 13 identifies a plurality of speakers, speaker identification processing unit 13 outputs the identification information of each speaker to speaker presence/absence determination unit 14, conversation degree update processing unit 17, and response appropriateness determination unit 19, respectively.
 会話者有無判定部14は、発話者特定処理部13から、発話者の識別情報として、複数の識別情報が出力されれば(図4のステップST3:YESの場合)、発話者と会話している会話者が存在していると判定する(図4のステップST4)。
 会話者有無判定部14は、発話者特定処理部13から、発話者の識別情報として、1つの識別情報が出力されれば(図4のステップST3:NOの場合)、発話者と会話している会話者が存在していないと判定する(図4のステップST5)。
 会話者有無判定部14は、発話者と会話している会話者が存在しているか否かを示す判定結果を会話度合更新処理部17に出力する。
If the speaker identification processing unit 13 outputs a plurality of pieces of identification information as speaker identification information (step ST3 in FIG. 4: YES), the speaker presence/absence determination unit 14 converses with the speaker. It is determined that there is a speaker who is present (step ST4 in FIG. 4).
If one piece of identification information is output from the speaker identification processing unit 13 as the identification information of the speaker (step ST3 in FIG. 4: NO), the speaker presence/absence determination unit 14 converses with the speaker. It is determined that there is no talker who is present (step ST5 in FIG. 4).
The talker presence/absence determination unit 14 outputs to the conversation degree update processing unit 17 a determination result indicating whether or not there is a talker who is having a conversation with the speaker.
 会話度合更新処理部17は、発話者特定処理部13から、発話者の識別情報を取得し、会話者有無判定部14から、会話者が存在しているか否かを示す判定結果を取得する。
 会話度合更新処理部17は、会話度合データ記憶部16に記憶されている過去の会話度合データの中から、取得した識別情報が示す発話者についての過去の会話度合データを取得する。
 識別情報が示す発話者についての過去の会話度合データが、会話度合データ記憶部16に記憶されていない場合、会話度合更新処理部17は、発話者についての会話度合データが示す会話の度合Kを初期設定する。会話の度合Kの初期値は、例えば、1である。
 会話度合更新処理部17は、取得した判定結果が、会話者が存在している旨を示していれば、会話の度合Kを高めるように、取得した会話度合データを更新する(図4のステップST6)。
 会話度合更新処理部17は、取得した判定結果が、会話者が存在していない旨を示していれば、会話の度合Kを低めるように、取得した会話度合データを更新する(図4のステップST7)。
 会話度合更新処理部17は、更新後の会話度合データを会話度合データ記憶部16に記憶させる。
The conversation degree update processing unit 17 acquires speaker identification information from the speaker identification processing unit 13 and acquires a determination result indicating whether or not a speaker exists from the speaker presence/absence determination unit 14 .
The conversation level update processing unit 17 acquires past conversation level data for the speaker indicated by the acquired identification information from past conversation level data stored in the conversation level data storage unit 16 .
If the past conversation level data for the speaker indicated by the identification information is not stored in the conversation level data storage unit 16, the conversation level update processing unit 17 updates the conversation level K indicated by the conversation level data for the speaker. Initialize. The initial value of the degree of conversation K is 1, for example.
The conversation degree update processing unit 17 updates the acquired conversation degree data so as to increase the degree K of conversation if the obtained determination result indicates that there is a conversation partner (step in FIG. 4). ST6).
If the acquired determination result indicates that there is no speaker, the conversation level update processing unit 17 updates the acquired conversation level data so as to lower the conversation level K (step in FIG. 4). ST7).
The conversation level update processing unit 17 stores the updated conversation level data in the conversation level data storage unit 16 .
 以下、会話度合更新処理部17による会話度合データの更新処理を具体的に説明する。
 会話度合更新処理部17は、発話者特定処理部13から、発話者の識別情報を取得すると、図示せぬ2つのカウンタ(1),(2)に対して、カウントを開始させる(図5AのステップST21、図5BのステップST31)。カウンタ(1),(2)は、会話度合更新処理部17に含まれていてもよいし、会話度合更新処理部17の外部に設けられていてもよい。
 会話度合更新処理部17の内部メモリには、第1の設定値及び第2の設定値のそれぞれが格納されている。第1の設定値<第2の設定値である。
Hereinafter, the update processing of the conversation degree data by the conversation degree update processing unit 17 will be specifically described.
When the conversation degree update processing unit 17 acquires the identification information of the speaker from the speaker identification processing unit 13, it causes two counters (1) and (2) (not shown) to start counting (see FIG. 5A). step ST21, step ST31 in FIG. 5B). Counters ( 1 ) and ( 2 ) may be included in conversation level update processing section 17 or may be provided outside conversation level update processing section 17 .
The internal memory of the conversation degree update processing unit 17 stores the first set value and the second set value. First setting value<second setting value.
 会話度合更新処理部17は、カウンタ(1)に対してカウントを開始させてから、カウンタ(1)のカウント値が第1の設定値に到達する前に、会話者が存在している旨を示している判定結果を受けると(図5AのステップST22:YESの場合)、会話の度合Kを高めるように、会話度合データを更新する(図5AのステップST23)。
 即ち、会話度合更新処理部17は、例えば、以下の式(1)に示すように、現在の会話の度合Kに度合変更値CHを加算することによって、会話の度合Kを高めるように、会話度合データを更新する。度合変更値CHは、事前に設定されている値であり、例えば、0.1である。度合変更値CHは、会話度合更新処理部17の内部メモリに格納されていてもよいし、音声認識装置5の外部から与えられるものであってもよい。
更新後の会話の度合K=現在の会話の度合K+度合変更値CH       (1)
The degree-of-conversation update processing unit 17 causes the counter (1) to start counting, and before the count value of the counter (1) reaches the first set value, indicates that there is a speaker. When the shown determination result is received (step ST22 in FIG. 5A: YES), the conversation level data is updated so as to increase the conversation level K (step ST23 in FIG. 5A).
That is, the conversation degree update processing unit 17 increases the conversation degree K by adding the degree change value CH to the current conversation degree K, for example, as shown in the following equation (1). Update degree data. The degree change value CH is a preset value, such as 0.1. The level change value CH may be stored in the internal memory of the conversation level update processing unit 17 or may be given from the outside of the speech recognition device 5 .
Post-update degree of conversation K=current degree of conversation K+degree change value CH (1)
 会話度合更新処理部17は、カウンタ(1)のカウント値が第1の設定値に到達したのち、カウンタ(2)のカウント値が第2の設定値に到達しても、会話者が存在している旨を示している判定結果を受けなければ(図5BのステップST32:YESの場合)、会話の度合Kを低めるように、会話度合データを更新する(図5BのステップST33)。
 即ち、会話度合更新処理部17は、例えば、以下の式(2)に示すように、現在の会話の度合Kから度合変更値CHを減算することによって、会話の度合Kを低めるように、会話度合データを更新する(図5BのステップST34)。
更新後の会話の度合K=現在の会話の度合K-度合変更値CH       (2)
After the count value of the counter (1) reaches the first set value, the conversation degree update processing unit 17 determines whether there is a talker even if the count value of the counter (2) reaches the second set value. If the determination result indicating that the conversation is not received (step ST32 in FIG. 5B: YES), the conversation level data is updated so as to lower the conversation level K (step ST33 in FIG. 5B).
That is, the conversation degree update processing unit 17 reduces the conversation degree K by subtracting the degree change value CH from the current conversation degree K, for example, as shown in the following equation (2). The degree data is updated (step ST34 in FIG. 5B).
Degree of conversation after update K=Current degree of conversation K−Degree change value CH (2)
 会話度合更新処理部17は、カウンタ(1)のカウント値が第1の設定値に到達する前に、会話者が存在している旨を示している判定結果を受けなければ(図5AのステップST22:NOの場合)、会話度合データの更新を行わない。
 会話度合更新処理部17は、カウンタ(1)のカウント値をゼロリセットする(図5AのステップST24)。
 会話度合更新処理部17は、カウンタ(2)のカウント値が第2の設定値に到達する前に、会話者が存在している旨を示している判定結果を受けると(図5BのステップST32:NOの場合)、会話度合データの更新を行わない。
 会話度合更新処理部17は、カウンタ(2)のカウント値をゼロリセットする(図5BのステップST34)。
The conversation degree update processing unit 17 must receive a determination result indicating that there is a speaker before the count value of the counter (1) reaches the first set value (step ST22: If NO), the conversation degree data is not updated.
The conversation degree update processing unit 17 resets the count value of the counter (1) to zero (step ST24 in FIG. 5A).
When the conversation degree update processing section 17 receives a determination result indicating that a speaker exists before the count value of the counter (2) reaches the second set value (step ST32 in FIG. 5B). : NO), the conversation degree data is not updated.
The conversation degree update processing unit 17 resets the count value of the counter (2) to zero (step ST34 in FIG. 5B).
 図1に示す音声認識装置5では、カウンタ(1),(2)に対してカウントを開始させてから、カウンタ(1)のカウント値が第1の設定値に到達する前に、会話者が存在している旨を示している判定結果を受ければ、会話度合更新処理部17が、会話の度合Kを高めるように、会話度合データを更新している。カウンタ(1)のカウント値が第1の設定値に到達する前に、会話者が存在している旨を示している判定結果を受けても、現在の会話の度合Kが、既に上限値に到達しているような場合、会話度合更新処理部17が、会話度合データの更新を行わないようにしてもよい。これにより、会話の度合Kが大きくなり過ぎることによって、長期間、会話者が存在していない状況が生じても、後述する補正後の応答度合P’が第1の閾値以上になり難くなることを回避できる。
 また、図1に示す音声認識装置5では、カウンタ(2)のカウント値が第2の設定値に到達しても、会話者が存在している旨を示している判定結果を受けなければ、会話度合更新処理部17が、会話の度合Kを低めるように、会話度合データを更新している。カウンタ(2)のカウント値が第2の設定値に到達しても、会話者が存在している旨を示している判定結果を受けなくても、現在の会話の度合Kが、既に下限値に到達しているような場合、会話度合更新処理部17が、会話度合データの更新を行わないようにしてもよい。これにより、会話の度合Kが小さくなり過ぎることによって、長期間、会話者が存在している状況が生じても、補正後の応答度合P’が第1の閾値未満になり難くなることを回避できる。
 上限値及び下限値のそれぞれは、会話度合更新処理部17の内部メモリに格納されていてもよいし、音声認識装置5の外部から与えられるものであってもよい。
In the speech recognition apparatus 5 shown in FIG. 1, after the counters (1) and (2) start counting, before the count value of the counter (1) reaches the first set value, the speaker Upon receiving the determination result indicating that the conversation exists, the conversation degree update processing unit 17 updates the conversation degree data so as to increase the degree K of conversation. Before the count value of the counter (1) reaches the first set value, even if a determination result indicating that there is a talker is received, the current degree of conversation K has already reached the upper limit value. If it is reached, the conversation level update processing unit 17 may not update the conversation level data. As a result, even if there is no speaker for a long period of time due to the degree K of conversation becoming too large, the degree of response P' after correction described later is less likely to exceed the first threshold value. can be avoided.
Further, in the speech recognition apparatus 5 shown in FIG. 1, even if the count value of the counter (2) reaches the second set value, if the determination result indicating the presence of the speaker is not received, The conversation degree update processing unit 17 updates the conversation degree data so as to lower the degree K of conversation. Even if the count value of the counter (2) reaches the second set value, even if the determination result indicating the presence of the talker is not received, the current degree of conversation K has already reached the lower limit value. is reached, the conversation level update processing unit 17 may not update the conversation level data. As a result, it is possible to prevent the post-correction response degree P′ from becoming less than the first threshold value even if a conversation partner exists for a long period of time due to the conversation degree K becoming too small. can.
Each of the upper limit value and the lower limit value may be stored in the internal memory of the conversation level update processing unit 17 or may be given from the outside of the speech recognition device 5 .
 応答適否判定部19は、発話者特定処理部13から、発話者の識別情報を取得する。
 応答適否判定部19は、会話度合データ記憶部16に記憶されている会話度合データの中から、取得した識別情報が示す発話者の会話度合データを取得する。
 図1に示す音声認識装置5では、応答適否判定部19が、会話度合更新処理部17による更新後の会話度合データを取得している。しかし、応答適否判定部19が取得する会話度合データは、発話者についての会話度合データであればよく、応答適否判定部19は、会話度合更新処理部17によって更新される前の発話者についての会話度合データを取得するようにしてもよい。
The response propriety determination unit 19 acquires speaker identification information from the speaker identification processing unit 13 .
The response propriety determining unit 19 acquires conversation level data of the speaker indicated by the acquired identification information from the conversation level data stored in the conversation level data storage unit 16 .
In the speech recognition apparatus 5 shown in FIG. 1, the response propriety determination unit 19 acquires conversation level data after being updated by the conversation level update processing unit 17 . However, the conversation degree data acquired by the response propriety determination unit 19 may be conversation degree data about the speaker. Conversation degree data may be acquired.
 応答適否判定部19は、カメラ1により撮影された空間である車室内の映像、マイク2により集音された車室内の音、車載センサ3から出力された走行位置データ及びナビゲーション装置4から出力された音声案内情報等に基づいて、発話者の発話が、音声認識装置5に対する発話であるか否かを判定するための指標である応答度合Pを算出する(図4のステップST8)。
 そして、応答適否判定部19は、以下の式(3)に示すように、応答度合Pを、会話度合データが示す会話の度合Kで除算することによって、応答度合Pを補正する(図4のステップST9)。
The response propriety determination unit 19 includes the image of the interior of the vehicle, which is the space captured by the camera 1, the sound of the interior of the vehicle collected by the microphone 2, the traveling position data output from the vehicle-mounted sensor 3, and the data output from the navigation device 4. Based on the received voice guidance information, etc., the degree of response P, which is an index for determining whether or not the utterance of the utterer is directed to the voice recognition device 5, is calculated (step ST8 in FIG. 4).
Then, the response propriety determination unit 19 corrects the response degree P by dividing the response degree P by the conversation degree K indicated by the conversation degree data, as shown in the following equation (3). step ST9).

Figure JPOXMLDOC01-appb-I000001
 式(3)において、P’は、補正後の応答度合であり、会話の度合Kが大きいほど小さくなる。
 応答度合Pの算出処理自体は、どのような算出処理でもよく、例えば、上記の特許文献1に開示されている算出処理であってもよい。特許文献1では、ユーザの一連の発話の文脈に基づいて、応答度合Pに相当する尤度スコアを算出している。

Figure JPOXMLDOC01-appb-I000001
In equation (3), P' is the degree of response after correction, and decreases as the degree of conversation K increases.
The calculation process itself of the degree of response P may be any calculation process, for example, the calculation process disclosed in the above-mentioned Patent Document 1. In Patent Literature 1, a likelihood score corresponding to the degree of response P is calculated based on the context of a series of user's utterances.
 例えば、ナビゲーション装置4から音声案内情報が出力されてから、一定時間以内に、マイク2により集音された音は、音声案内情報に対する応答の可能性が高く、発話者の発話に対する応答の可能性が低いことが想定される。このため、応答適否判定部19は、ナビゲーション装置4から音声案内情報が出力されてから、一定時間以内に、マイク2により音が集音された場合、大きな応答度合Pを算出する。
 一方、ナビゲーション装置4から音声案内情報が出力されたのち、一定時間を経過してから、マイク2により音が集音された場合、上記の応答度合Pよりも小さな応答度合Pを算出する。
For example, the sound collected by the microphone 2 within a certain period of time after the voice guidance information is output from the navigation device 4 is highly likely to be a response to the voice guidance information, and is likely to be a response to the utterance of the speaker. is assumed to be low. Therefore, the response propriety determination unit 19 calculates a large response degree P when sound is collected by the microphone 2 within a certain period of time after the voice guidance information is output from the navigation device 4 .
On the other hand, when sound is collected by the microphone 2 after a certain period of time has passed since the voice guidance information was output from the navigation device 4, the degree of response P smaller than the degree of response P described above is calculated.
 例えば、マイク2により音が連続的に集音されている場合、マイク2により集音された音は、オーディオ機器によって再生された音、又は、騒音等である可能性が高いことが想定される。応答適否判定部19は、マイク2により音が連続的に集音されている場合、小さな応答度合Pを算出する。
 一方、マイク2により音が間欠的に集音されている場合、上記の応答度合Pよりも大きな応答度合Pを算出する。
For example, when sound is continuously collected by the microphone 2, it is assumed that the sound collected by the microphone 2 is likely to be the sound reproduced by the audio equipment or noise. . The response propriety determination unit 19 calculates a small response degree P when sounds are continuously collected by the microphone 2 .
On the other hand, when the sound is intermittently collected by the microphone 2, the response degree P larger than the above response degree P is calculated.
 応答適否判定部19は、補正後の応答度合P’と第1の閾値とを比較する。
 応答適否判定部19は、補正後の応答度合P’が第1の閾値以上であれば(図4のステップST10:YESの場合)、発話者の発話が、音声認識装置5に対する発話であるとの判定を行う(図4のステップST11)。
 応答適否判定部19は、補正後の応答度合P’が第1の閾値未満であれば(図4のステップST10:NOの場合)、発話者の発話が、音声認識装置5に対する発話ではないとの判定を行う(図4のステップST12)。
The response propriety determination unit 19 compares the corrected response degree P′ with the first threshold.
If the degree of response P′ after correction is equal to or greater than the first threshold (step ST10 in FIG. 4: YES), the response propriety determination unit 19 determines that the utterance of the utterer is directed to the speech recognition device 5. is determined (step ST11 in FIG. 4).
If the response degree P′ after correction is less than the first threshold (step ST10 in FIG. 4: NO), the response propriety determination unit 19 determines that the utterance of the utterer is not the utterance to the speech recognition device 5. is determined (step ST12 in FIG. 4).
 図1に示す音声認識装置5では、応答適否判定部19が、補正後の応答度合P’が第1の閾値以上であれば、発話者の発話が、音声認識装置5に対する発話であるとの判定を行うようにしている。しかし、これは一例に過ぎず、応答適否判定部19が、取得した会話度合データが示す会話の度合Kが第2の閾値以下であれば、発話者の発話が、音声認識装置5に対する発話であるとの判定を行い、会話の度合Kが第2の閾値よりも大きければ、発話者の発話が、音声認識装置5に対する発話ではないとの判定を行うようにしてもよい。 In the speech recognition device 5 shown in FIG. 1, the response propriety determination unit 19 determines that the utterance of the speaker is directed to the speech recognition device 5 if the response degree P′ after correction is equal to or greater than the first threshold. I am trying to make a judgment. However, this is only an example, and if the degree K of conversation indicated by the obtained conversation degree data is equal to or less than the second threshold, the response suitability determination unit 19 determines that the utterance of the speaker is an utterance to the speech recognition device 5. If it is determined that there is, and the degree of conversation K is greater than the second threshold, it may be determined that the utterance of the speaker is not the utterance to the speech recognition device 5 .
 音声認識部20は、マイク2から出力された音データを取得する。
 音声認識部20は、取得した音データが示す音を音声認識し、音声認識結果を示す音声認識結果データを応答データ生成部21に出力する。
 応答データ生成部21は、応答適否判定部19によって、発話者の発話が、音声認識装置5に対する発話であるとの判定が行われていれば、音声認識部20から出力された音声認識結果データに対する応答データを生成する(図4のステップST13)。
 応答データ生成部21は、応答適否判定部19によって、発話者の発話が、音声認識装置5に対する発話ではないとの判定が行われていれば、音声認識結果データに対する応答データを生成しない。
 これにより、ユーザ同士の会話を、音声認識装置5に対する発話であると誤認して、応答データ生成部21が、応答データを生成する可能性を低減することができる。また、音声認識装置5に対する発話を、ユーザ同士の会話であると誤認して、応答データ生成部21が、応答データを生成しない可能性を低減することができる。
The voice recognition unit 20 acquires sound data output from the microphone 2 .
The speech recognition unit 20 speech-recognizes the sound indicated by the acquired sound data, and outputs speech recognition result data indicating the speech recognition result to the response data generation unit 21 .
If the response propriety determination unit 19 determines that the utterance of the speaker is directed to the voice recognition device 5, the response data generation unit 21 generates the voice recognition result data output from the voice recognition unit 20. is generated (step ST13 in FIG. 4).
If the response propriety determination unit 19 determines that the utterance of the speaker is not the utterance to the voice recognition device 5, the response data generation unit 21 does not generate response data for the voice recognition result data.
This reduces the possibility that the response data generator 21 generates response data by erroneously recognizing a conversation between users as an utterance to the speech recognition device 5 . In addition, it is possible to reduce the possibility that the response data generation unit 21 will not generate response data by misrecognizing an utterance to the speech recognition device 5 as a conversation between users.
 音声認識結果データに対する応答データの生成処理自体は、公知の技術であるため詳細な説明を省略する。
 音声認識結果データが、例えば、「寒い」を示すデータであれば、応答データ生成部21は、車載機器6であるエアコンに対して、例えば、「設定温度を1度上げる」旨の応答内容を示す応答データを生成する。
 音声認識結果データが、例えば、「音量が小さい」を示すデータであれば、応答データ生成部21は、車載機器6であるオーディオ機器に対して、例えば、「再生音量を上げる」旨の応答内容を示す応答データを生成する。
 音声認識結果データが、例えば、「目的地を〇〇海岸とする」を示すデータであれば、応答データ生成部21は、車載機器6であるナビゲーション装置に対して、例えば、「目的地を〇〇海岸に設定する」旨の応答内容を示す応答データを生成する。
The processing itself for generating response data for speech recognition result data is a known technique, and detailed description thereof will be omitted.
If the voice recognition result data is, for example, data indicating "cold", the response data generation unit 21 sends the air conditioner, which is the in-vehicle device 6, a response content to the effect of "increase the set temperature by 1 degree", for example. Generate the response data shown.
If the speech recognition result data is data indicating, for example, "the volume is low", the response data generation unit 21 sends the audio device, which is the vehicle-mounted device 6, response content to the effect that, for example, "increase the playback volume". Generate response data that indicates
If the speech recognition result data is data indicating, for example, "Set the destination to XX coast", the response data generation unit 21 sends to the navigation device, which is the in-vehicle device 6, for example, "Set the destination to XX 〇 Create response data indicating the content of the response to the effect that it will be set to the coast.
 応答データ生成部21は、応答データを車載機器6及び出力装置7のそれぞれに出力する。
 車載機器6は、応答データ生成部21から出力された応答データを取得する。
 車載機器6は、応答データに従って動作する。車載機器6がエアコンであり、応答データが、例えば、「設定温度を1度上げる」旨の応答内容を示していれば、車載機器6であるエアコンは、設定温度を1度上げるように動作する。
 車載機器6がオーディオ機器であり、応答データが、例えば、「再生音量を上げる」旨の応答内容を示す応答データを示していれば、車載機器6であるオーディオ機器は、再生音量を上げるように動作する。
 車載機器6がナビゲーション装置であり、応答データが、例えば、「目的地を〇〇海岸に設定する」旨の応答内容を示す応答データを示していれば、車載機器6であるナビゲーション装置は、目的地を〇〇海岸に設定するように動作する。
The response data generator 21 outputs the response data to the vehicle-mounted device 6 and the output device 7, respectively.
The vehicle-mounted device 6 acquires the response data output from the response data generator 21 .
The in-vehicle device 6 operates according to the response data. If the in-vehicle device 6 is an air conditioner and the response data indicates, for example, the content of the response to the effect that "increase the set temperature by 1 degree", the air conditioner as the in-vehicle device 6 operates to raise the set temperature by 1 degree. .
If the in-vehicle device 6 is an audio device and the response data indicates, for example, response data indicating the content of the response to the effect that "increase the playback volume", the audio device as the in-vehicle device 6 increases the playback volume. Operate.
If the in-vehicle device 6 is a navigation device and the response data indicates, for example, the response data indicating the content of the response to the effect that "the destination is set to XX coast", the navigation device that is the in-vehicle device 6 can Works to set the ground to 〇〇 coast.
 出力装置7は、応答データ生成部21から出力された応答データが示す応答内容を出力する。
 出力装置7が、例えば、ディスプレイであれば、出力装置7であるディスプレイは、応答データが示す応答内容を表示する。出力装置7が、例えば、照明機器であれば、出力装置7である照明機器は、音声認識装置5から応答データが出力されたことが分かるように、照明の色等を変更する。
 出力装置7が、例えば、スピーカであれば、出力装置7であるスピーカは、応答データが示す応答内容を音声出力する。
The output device 7 outputs the content of the response indicated by the response data output from the response data generator 21 .
If the output device 7 is, for example, a display, the display, which is the output device 7, displays the content of the response indicated by the response data. If the output device 7 is, for example, a lighting device, the lighting device that is the output device 7 changes the color of the lighting so that it can be seen that the response data has been output from the speech recognition device 5 .
If the output device 7 is, for example, a speaker, the speaker, which is the output device 7, outputs the content of the response indicated by the response data.
 図1に示す音声認識装置5では、会話者有無判定部14が、発話者特定処理部13から出力された発話者の識別情報に基づいて、発話者と会話している会話者が存在しているか否かを判定している。しかし、これは一例に過ぎず、会話者有無判定部14が、カメラ1から出力された映像データが示す車室内の映像に基づいて、発話者と異なるユーザの動作を解析し、動作の解析結果に基づいて、発話者と異なるユーザが、発話者と会話しているか否かを判定するようにしてもよい。
 以下、動作の解析結果に基づく、会話者有無判定部14の判定処理を具体的に説明する。
In the speech recognition device 5 shown in FIG. It is determined whether there is However, this is only an example, and the speaker presence/absence determination unit 14 analyzes the motion of a user other than the speaker based on the image of the interior of the vehicle indicated by the video data output from the camera 1, and analyzes the result of the motion. , it may be determined whether or not a user different from the speaker is conversing with the speaker.
The determination processing of the speaker presence/absence determining unit 14 based on the motion analysis result will be specifically described below.
 会話者有無判定部14は、発話者特定処理部13から、発話者の識別情報を取得し、カメラ1から出力された映像データを取得する。当該映像データは、カメラ1から乗員特定部12に出力された映像データよりも、時間的に後の映像データである。
 会話者有無判定部14は、映像データが示す車室内の映像から、識別情報が示す発話者と異なるユーザの顔画像を切り出し、それぞれのユーザの顔画像を解析することによって、口が動いているユーザを特定する。
 会話者有無判定部14は、口が動いているユーザが存在していれば、発話者と会話している会話者が存在していると判定する。
 会話者有無判定部14は、口が動いているユーザが存在していなければ、発話者と会話している会話者が存在していないと判定する。
 ここでは、会話者有無判定部14が、発話者と異なるユーザの顔画像を解析することによって、口が動いているユーザを特定している。しかし、これは一例に過ぎず、会話者有無判定部14は、ユーザの顔画像を解析することによって、頷いているユーザ、首を横に振っているユーザ、あるいは、発話者を見ているユーザ等を特定するようにしてもよい。
 会話者有無判定部14は、頷いているユーザ等が存在していれば、発話者と会話している会話者が存在していると判定し、頷いているユーザ等が存在していなければ、発話者と会話している会話者が存在していないと判定する。
The speaker presence/absence determination unit 14 acquires speaker identification information from the speaker identification processing unit 13 and acquires video data output from the camera 1 . The video data is video data temporally later than the video data output from the camera 1 to the occupant identification unit 12 .
The speaker presence/absence determination unit 14 cuts out facial images of users different from the speaker indicated by the identification information from the video of the vehicle interior indicated by the video data, and analyzes the facial images of each user to determine whether the mouth is moving. Identify users.
If there is a user whose mouth is moving, the speaker presence/absence determination unit 14 determines that there is a speaker who is conversing with the speaker.
If there is no user whose mouth is moving, the speaker presence/absence determination unit 14 determines that there is no speaker conversing with the speaker.
Here, the speaker presence/absence determination unit 14 identifies the user whose mouth is moving by analyzing the face image of the user different from the speaker. However, this is only an example, and the speaker presence/absence determination unit 14 analyzes the face image of the user to determine whether the user is nodding, shaking his or her head, or looking at the speaker. etc. may be specified.
If there is a user or the like nodding, the speaker presence/absence determination unit 14 determines that there is a speaker conversing with the speaker, and if there is no user or the like nodding, It is determined that there is no speaker who is having a conversation with the speaker.
 図1に示す音声認識装置5では、応答適否判定部19が、式(3)に示すように、会話度合データが示す会話の度合Kに基づいて、応答度合Pを補正している。しかし、これは一例に過ぎず、応答適否判定部19が、以下の式(4)に示すように、応答度合Pから、会話度合データが示す会話の度合Kを減算することによって、応答度合Pを補正するようにしてもよい。
P’=P-K       (4)
 式(4)によって、応答度合Pが補正される場合、会話度合更新処理部17は、発話者の会話度合データが示す会話の度合Kの初期値として、例えば、0を設定する。
In the speech recognition apparatus 5 shown in FIG. 1, the response propriety determination unit 19 corrects the response degree P based on the conversation degree K indicated by the conversation degree data, as shown in Equation (3). However, this is only an example, and the response propriety determination unit 19 subtracts the conversation degree K indicated by the conversation degree data from the response degree P as shown in the following equation (4). may be corrected.
P'=PK (4)
When the degree of response P is corrected by equation (4), the conversation degree update processing unit 17 sets, for example, 0 as the initial value of the degree of conversation K indicated by the conversation degree data of the speaker.
 図1に示す音声認識装置5では、いずれかの乗員が発話する毎に、乗員特定部12が、それぞれの乗員の個人認証を実施してから、発話者特定処理部13が、発話者を特定している。しかし、これは一例に過ぎず、乗員特定部12は、いずれかの乗員が発話しても、いずれかの乗員の座席に変化がなければ、乗員特定部12が、それぞれの乗員の個人認証を実施せずに、発話者特定処理部13が、発話者を特定するようにしてもよい。即ち、乗員特定部12は、カメラ1から出力された映像データに基づいて、それぞれの乗員が座っている座席の位置を特定し、いずれかの乗員の座席に変化があるときだけ、それぞれの乗員の個人認証を再度実施するようにしてもよい。いずれかの乗員の座席に変化がある態様としては、例えば、車両からの乗員の乗り降りが考えられる。 In the speech recognition device 5 shown in FIG. 1, every time one of the passengers speaks, the passenger identification unit 12 performs personal authentication of each passenger, and then the speaker identification processing unit 13 identifies the speaker. is doing. However, this is only an example, and the occupant identification unit 12 may perform personal authentication of each occupant if there is no change in the seat of any occupant even if one of the occupants speaks. Instead, the speaker identification processing unit 13 may identify the speaker. That is, the occupant identification unit 12 identifies the position of the seat on which each occupant is seated based on the image data output from the camera 1, and only when there is a change in the seat of any occupant, each occupant may be performed again. An example of a mode in which there is a change in the seat of one of the occupants is when the occupant gets in and out of the vehicle.
 以上の実施の形態1では、カメラ1により撮影された空間内の映像、又は、マイク2により集音された空間内の音に基づいて、空間内に存在している複数のユーザの中で、発話しているユーザである発話者を特定する発話者特定部11と、複数のユーザの中で、発話者特定部11により特定された発話者以外のユーザと、発話者との過去の会話の度合を示す会話度合データを取得し、会話度合データに基づいて、発話者の発話が、音声認識装置5に対する発話であるか否かを判定し、音声認識装置5に対する発話であるとの判定を行ったときに限り、発話者の発話に対する応答データを生成する応答部18とを備えるように、音声認識装置5を構成した。したがって、音声認識装置5は、ユーザの発話が、音声認識装置5に対する発話であるときに、ユーザ同士の会話であると誤認する確率を、特許文献1に開示されている音声認識装置よりも低減することができる。 In the first embodiment described above, based on the video in the space captured by the camera 1 or the sound in the space collected by the microphone 2, among a plurality of users existing in the space, A speaker identification unit 11 that identifies a speaker who is a user who is speaking, a user other than the speaker identified by the speaker identification unit 11 among a plurality of users, and past conversations with the speaker. Conversation degree data indicating the degree is acquired, based on the conversation degree data, it is determined whether or not the utterance of the speaker is directed to the speech recognition device 5, and is determined to be directed to the speech recognition device 5. The speech recognition device 5 is configured to include a response unit 18 that generates response data to the utterance of the speaker only when the speech recognition device 5 is performed. Therefore, the speech recognition device 5 reduces the probability of misrecognizing a conversation between users when the user's speech is directed to the speech recognition device 5, compared to the speech recognition device disclosed in Patent Document 1. can do.
実施の形態2.
 実施の形態2では、ナビゲーション装置4に設定されている目的地、又は、車両の走行経路から、車両の走行目的を予測する走行目的予測部22を備える音声認識装置5について説明する。
Embodiment 2.
In Embodiment 2, a voice recognition device 5 having a travel purpose prediction unit 22 that predicts the travel purpose of the vehicle from the destination set in the navigation device 4 or the travel route of the vehicle will be described.
 図6は、実施の形態2に係る音声認識装置5を示す構成図である。
 図7は、実施の形態2に係る音声認識装置5のハードウェアを示すハードウェア構成図である。
 図6及び図7において、図1及び図2と同一符号は同一又は相当部分を示すので説明を省略する。
FIG. 6 is a configuration diagram showing a speech recognition device 5 according to Embodiment 2. As shown in FIG.
FIG. 7 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to the second embodiment.
In FIGS. 6 and 7, the same reference numerals as those in FIGS. 1 and 2 denote the same or corresponding parts, so description thereof will be omitted.
 走行目的予測部22は、例えば、図7に示す走行目的予測回路35によって実現される。
 走行目的予測部22は、ナビゲーション装置4に設定されている目的地を示す目的地設定データ、又は、ナビゲーション装置4に記録されている車両の走行経路を示す走行経路データを取得する。
 走行目的予測部22は、目的地設定データが示す目的地、又は、走行経路データが示す走行経路から、車両の走行目的を予測する。
 走行目的予測部22は、予測した車両の走行目的を示す走行目的データを会話度合更新部23に出力する。
The travel purpose prediction unit 22 is realized by, for example, a travel purpose prediction circuit 35 shown in FIG.
The travel purpose prediction unit 22 acquires destination setting data indicating the destination set in the navigation device 4 or travel route data indicating the travel route of the vehicle recorded in the navigation device 4 .
The travel purpose prediction unit 22 predicts the travel purpose of the vehicle from the destination indicated by the destination setting data or the travel route indicated by the travel route data.
The travel purpose prediction unit 22 outputs travel purpose data indicating the predicted travel purpose of the vehicle to the conversation degree update unit 23 .
 図6に示す音声認識装置5では、走行目的予測部22が、ナビゲーション装置4から、走行経路データを取得している。しかし、これは一例に過ぎず、車載センサ3が例えばGPSセンサによって実現されている場合、走行目的予測部22は、GPSセンサから出力されたGPSデータを取得し、GPSデータから車両の走行経路を特定するようにしてもよい。また、車載センサ3が例えばジャイロセンサによって実現されている場合、走行目的予測部22は、ジャイロセンサから出力された角速度データを取得し、角速度データから車両の走行経路を特定するようにしてもよい。 In the speech recognition device 5 shown in FIG. 6, the travel purpose prediction unit 22 acquires travel route data from the navigation device 4. However, this is only an example, and when the vehicle-mounted sensor 3 is realized by a GPS sensor, for example, the travel purpose prediction unit 22 acquires GPS data output from the GPS sensor, and calculates the travel route of the vehicle from the GPS data. You may make it specify. Further, when the vehicle-mounted sensor 3 is implemented by, for example, a gyro sensor, the travel purpose prediction unit 22 may acquire angular velocity data output from the gyro sensor and identify the travel route of the vehicle from the angular velocity data. .
 会話度合更新部23は、例えば、図7に示す会話度合更新回路36によって実現される。
 会話度合更新部23は、会話度合データ記憶部24及び会話度合更新処理部25を備えている。
 会話度合更新部23は、発話者特定部11から発話者の識別情報を取得し、走行目的予測部22から走行目的データを取得する。
 会話度合更新部23は、識別情報が示す発話者についての過去の会話の度合を示す走行目的別の複数の会話度合データの中から、走行目的データが示す走行目的の会話度合データを取得する。
 図6に示す音声認識装置5では、会話度合更新部23が、内部の会話度合データ記憶部24から、会話度合データを取得している。しかし、これは一例に過ぎず、会話度合更新部23が、音声認識装置5の外部から会話度合データを取得するようにしてもよい。
 会話度合更新部23は、会話者有無判定部14により会話者が存在していると判定されれば、会話の度合を高めるように、取得した会話度合データを更新する。
 会話度合更新部23は、会話者有無判定部14により会話者が存在していないと判定されれば、会話の度合を低めるように、取得した会話度合データを更新する。
The conversation level update unit 23 is implemented by, for example, a conversation level update circuit 36 shown in FIG.
The conversation level update unit 23 includes a conversation level data storage unit 24 and a conversation level update processing unit 25 .
The conversation level update unit 23 acquires speaker identification information from the speaker identification unit 11 and acquires travel purpose data from the travel purpose prediction unit 22 .
The conversation degree update unit 23 acquires the conversation degree data for the driving purpose indicated by the driving purpose data from among a plurality of conversation degree data for each driving purpose indicating the past conversation degree of the speaker indicated by the identification information.
In the speech recognition apparatus 5 shown in FIG. 6, the conversation level update unit 23 acquires the conversation level data from the conversation level data storage unit 24 inside. However, this is only an example, and the conversation level update unit 23 may acquire the conversation level data from outside the speech recognition device 5 .
Conversation degree update unit 23 updates the acquired conversation degree data so as to increase the degree of conversation when it is determined by talker presence/absence determination unit 14 that a talker exists.
Conversation degree update unit 23 updates the acquired conversation degree data so as to lower the degree of conversation when it is determined by talker presence/absence determination unit 14 that there is no talker.
 会話度合データ記憶部24は、走行目的別の複数の会話度合データを記憶している記憶媒体である。車室内に複数の乗員が存在している場合、それぞれの乗員が発話者になり得るため、それぞれの乗員について、走行目的別の複数の会話度合データを記憶している。走行目的としては、レジャー、ビジネス、又は、ショッピング等がある。車室内に存在している乗員が例えば3人であり、走行目的が例えば4つであれば、会話度合データ記憶部16は、12(=3×4)個の会話度合データを記憶している。 The conversation degree data storage unit 24 is a storage medium that stores a plurality of pieces of conversation degree data for each driving purpose. When there are a plurality of passengers in the vehicle, each passenger can become a speaker. Therefore, a plurality of conversation degree data are stored for each passenger according to driving purpose. Driving purposes include leisure, business, shopping, and the like. If there are, for example, three passengers in the vehicle and there are, for example, four driving purposes, the conversation degree data storage unit 16 stores 12 (=3×4) pieces of conversation degree data. .
 会話度合更新処理部25は、発話者特定処理部13から発話者の識別情報を取得し、走行目的予測部22から走行目的データを取得する。
 会話度合更新処理部25は、会話度合データ記憶部24に記憶されている走行目的別の複数の会話度合データの中から、発話者特定処理部13から出力された識別情報が示す発話者の会話度合データであり、かつ、走行目的データが示す走行目的の会話度合データを取得する。
 会話度合更新処理部25は、会話者有無判定部14から出力された判定結果が、会話者が存在している旨を示していれば、会話の度合を高めるように、取得した会話度合データを更新する。
 会話度合更新処理部25は、会話者有無判定部14から出力された判定結果が、会話者が存在していない旨を示していれば、会話の度合を低めるように、取得した会話度合データを更新する。
 会話度合更新処理部25は、更新後の会話度合データを会話度合データ記憶部24に記憶させる。
The conversation degree update processing unit 25 acquires speaker identification information from the speaker identification processing unit 13 and acquires travel purpose data from the travel purpose prediction unit 22 .
The conversation degree update processing unit 25 updates the conversation of the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the plurality of conversation degree data for each driving purpose stored in the conversation degree data storage unit 24. Conversation level data for the driving purpose indicated by the driving purpose data, which is degree data, is acquired.
If the determination result output from the speaker presence/absence determination unit 14 indicates that a speaker is present, the conversation level update processing unit 25 updates the acquired conversation level data so as to increase the level of conversation. Update.
If the determination result output from the speaker presence/absence determining unit 14 indicates that the speaker does not exist, the conversation level update processing unit 25 updates the acquired conversation level data so as to lower the level of conversation. Update.
The conversation level update processing unit 25 stores the updated conversation level data in the conversation level data storage unit 24 .
 図6では、音声認識装置5の構成要素である発話者特定部11、会話者有無判定部14、走行目的予測部22、会話度合更新部23及び応答部18のそれぞれが、図7に示すような専用のハードウェアによって実現されるものを想定している。即ち、音声認識装置5が、発話者特定回路31、会話者有無判定回路32、走行目的予測回路35、会話度合更新回路36及び応答回路34によって実現されるものを想定している。
 発話者特定回路31、会話者有無判定回路32、走行目的予測回路35、会話度合更新回路36及び応答回路34のそれぞれは、例えば、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ASIC、FPGA、又は、これらを組み合わせたものが該当する。
In FIG. 6, each of the speaker identification unit 11, the speaker presence/absence determination unit 14, the travel purpose prediction unit 22, the conversation degree update unit 23, and the response unit 18, which are the constituent elements of the speech recognition device 5, are configured as shown in FIG. It is assumed to be realized by special dedicated hardware. That is, it is assumed that the speech recognition device 5 is implemented by a speaker identification circuit 31, a speaker presence/absence determination circuit 32, a driving purpose prediction circuit 35, a conversation degree update circuit 36, and a response circuit 34. FIG.
Each of the speaker identification circuit 31, the speaker presence/absence determination circuit 32, the driving purpose prediction circuit 35, the conversation degree update circuit 36, and the response circuit 34 is, for example, a single circuit, a composite circuit, a programmed processor, or a parallel programmed circuit. Processors, ASICs, FPGAs, or combinations thereof are applicable.
 音声認識装置5の構成要素は、専用のハードウェアによって実現されるものに限るものではなく、音声認識装置5が、ソフトウェア、ファームウェア、又は、ソフトウェアとファームウェアとの組み合わせによって実現されるものであってもよい。
 音声認識装置5が、ソフトウェア又はファームウェア等によって実現される場合、発話者特定部11、会話者有無判定部14、走行目的予測部22、会話度合更新部23及び応答部18におけるそれぞれの処理手順をコンピュータに実行させるためのプログラムが図3に示すメモリ41に格納される。そして、図3に示すプロセッサ42がメモリ41に格納されているプログラムを実行する。
The components of the speech recognition device 5 are not limited to those realized by dedicated hardware, and the speech recognition device 5 may be realized by software, firmware, or a combination of software and firmware. good too.
When the speech recognition device 5 is realized by software or firmware, each processing procedure in the speaker identification unit 11, the speaker presence/absence determination unit 14, the driving purpose prediction unit 22, the conversation level update unit 23, and the response unit 18 is performed. A program to be executed by the computer is stored in the memory 41 shown in FIG. Then, the processor 42 shown in FIG. 3 executes the program stored in the memory 41 .
 また、図7では、音声認識装置5の構成要素のそれぞれが専用のハードウェアによって実現される例を示し、図3では、音声認識装置5がソフトウェア又はファームウェア等によって実現される例を示している。しかし、これは一例に過ぎず、音声認識装置5における一部の構成要素が専用のハードウェアによって実現され、残りの構成要素がソフトウェア又はファームウェア等によって実現されるものであってもよい。 7 shows an example in which each component of the speech recognition device 5 is realized by dedicated hardware, and FIG. 3 shows an example in which the speech recognition device 5 is realized by software, firmware, or the like. . However, this is only an example, and some components of the speech recognition device 5 may be implemented by dedicated hardware, and the remaining components may be implemented by software, firmware, or the like.
 次に、図6に示す音声認識装置5の動作について説明する。
 走行目的予測部22及び会話度合更新部23以外は、図1に示す音声認識装置5と同様であるため、ここでは、主に、走行目的予測部22及び会話度合更新部23の動作について説明する。
Next, the operation of the speech recognition device 5 shown in FIG. 6 will be described.
Since the parts other than the travel purpose prediction unit 22 and the conversation degree update unit 23 are the same as those of the speech recognition device 5 shown in FIG. .
 ナビゲーション装置4は、目的地が設定されていれば、目的地を示す目的地設定データを走行目的予測部22に出力する。
 走行目的予測部22は、ナビゲーション装置4から、目的地設定データを取得する。
 走行目的予測部22は、目的地設定データが示す目的地から、車両の走行目的を予測する。
 目的地が、遊園地、又は、球技場等のレジャー施設であれば、走行目的予測部22は、車両の走行目的がレジャーであると予測する。
 目的地が、オフィスビル、又は、工場等のビジネス施設であれば、走行目的予測部22は、車両の走行目的がビジネスであると予測する。
 目的地が、デパート、又は、スーパーマーケット等のショッピング施設であれば、走行目的予測部22は、車両の走行目的がショッピングであると予測する。
 走行目的予測部22は、車両の走行目的を示す走行目的データを会話度合更新部23に出力する。
If the destination is set, the navigation device 4 outputs destination setting data indicating the destination to the travel purpose prediction unit 22 .
The travel purpose prediction unit 22 acquires destination setting data from the navigation device 4 .
The travel purpose prediction unit 22 predicts the travel purpose of the vehicle from the destination indicated by the destination setting data.
If the destination is an amusement park or a leisure facility such as a ball game ground, the travel purpose prediction unit 22 predicts that the travel purpose of the vehicle is leisure.
If the destination is an office building or a business facility such as a factory, the travel purpose prediction unit 22 predicts that the travel purpose of the vehicle is business.
If the destination is a shopping facility such as a department store or a supermarket, the travel purpose prediction unit 22 predicts that the travel purpose of the vehicle is shopping.
The travel purpose prediction unit 22 outputs travel purpose data indicating the travel purpose of the vehicle to the conversation level update unit 23 .
 走行目的予測部22は、目的地がナビゲーション装置4に設定されていなければ、ナビゲーション装置4から、車両の走行経路を示す走行経路データを取得する。
 走行目的予測部22は、車載センサ3が例えばGPSセンサによって実現されている場合、ナビゲーション装置4から走行経路データを取得する代わりに、GPSセンサからGPSデータを取得し、GPSデータから車両の走行経路を特定するようにしてもよい。
 走行目的予測部22は、車載センサ3が例えばジャイロセンサによって実現されている場合、ジャイロセンサから角速度データを取得し、角速度データから車両の走行経路を特定するようにしてもよい。
If the destination is not set in the navigation device 4 , the travel purpose prediction unit 22 acquires travel route data indicating the travel route of the vehicle from the navigation device 4 .
When the in-vehicle sensor 3 is implemented by a GPS sensor, for example, the travel purpose prediction unit 22 acquires GPS data from the GPS sensor instead of acquiring travel route data from the navigation device 4, and calculates the travel route of the vehicle from the GPS data. may be specified.
If the in-vehicle sensor 3 is implemented by, for example, a gyro sensor, the travel purpose prediction unit 22 may acquire angular velocity data from the gyro sensor and identify the travel route of the vehicle from the angular velocity data.
 走行目的予測部22は、走行経路データを図示せぬ学習モデルに与えて、学習モデルから、車両の走行目的を示す走行目的データを取得する。
 走行目的予測部22は、走行目的データを会話度合更新部23に出力する。
 当該学習モデルは、車両の走行経路を示す走行経路データと、車両の走行目的を示す教師データとを用いて、車両の走行目的を機械学習しているものである。学習済みの学習モデルは、走行経路データが与えられると、車両の走行目的を示す走行目的データを出力する。
 図6に示す音声認識装置5では、走行目的予測部22が、学習モデルを用いて、車両の走行目的を予測している。しかし、これは一例に過ぎず、走行目的予測部22は、例えば、ルールベースを用いて、車両の走行目的を予測するようにしてもよい。
The travel purpose prediction unit 22 supplies the travel route data to a learning model (not shown), and acquires travel purpose data indicating the travel purpose of the vehicle from the learning model.
The travel purpose prediction unit 22 outputs the travel purpose data to the conversation degree update unit 23 .
The learning model machine-learns the driving purpose of the vehicle using driving route data indicating the driving route of the vehicle and teacher data indicating the driving purpose of the vehicle. The learned learning model outputs travel purpose data indicating the travel purpose of the vehicle when given travel route data.
In the speech recognition device 5 shown in FIG. 6, the travel purpose prediction unit 22 predicts the travel purpose of the vehicle using a learning model. However, this is only an example, and the travel purpose prediction unit 22 may predict the travel purpose of the vehicle using, for example, a rule base.
 会話度合更新処理部25は、発話者特定処理部13から発話者の識別情報を取得し、走行目的予測部22から走行目的データを取得する。
 会話度合更新処理部25は、会話度合データ記憶部24に記憶されている走行目的別の複数の会話度合データの中から、識別情報が示す発話者の会話度合データであり、かつ、走行目的データが示す走行目的の会話度合データを取得する。
The conversation degree update processing unit 25 acquires speaker identification information from the speaker identification processing unit 13 and acquires travel purpose data from the travel purpose prediction unit 22 .
The conversation level update processing unit 25 updates the conversation level data of the speaker indicated by the identification information from among the plurality of conversation level data for each driving purpose stored in the conversation level data storage unit 24, and the driving purpose data. Acquire the degree of conversation data for the purpose of driving indicated by .
 会話度合更新処理部25は、会話者有無判定部14から、発話者と会話している会話者が存在しているか否かを示す判定結果を取得する。
 会話度合更新処理部25は、判定結果が、会話者が存在している旨を示していれば、会話の度合Kを高めるように、取得した会話度合データを更新する。
 会話度合更新処理部25は、判定結果が、会話者が存在していない旨を示していれば、会話の度合Kを低めるように、取得した会話度合データを更新する。
 会話度合更新処理部25による会話度合データの更新処理は、図1に示す会話度合更新処理部17による会話度合データの更新処理と同様であるため、更新処理の具体的な説明は省略する。
 会話度合更新処理部25は、更新後の会話度合データを会話度合データ記憶部24に記憶させる。
The conversation degree update processing unit 25 acquires a determination result indicating whether or not there is a speaker who is having a conversation with the speaker from the speaker presence/absence determination unit 14 .
The conversation degree update processing unit 25 updates the acquired conversation degree data so as to increase the degree K of conversation if the determination result indicates that there is a speaker.
The conversation level update processing unit 25 updates the acquired conversation level data so as to lower the conversation level K if the determination result indicates that there is no speaker.
The update processing of the conversation degree data by the conversation degree update processing unit 25 is the same as the update processing of the conversation degree data by the conversation degree update processing unit 17 shown in FIG. 1, so a specific description of the update processing is omitted.
The conversation level update processing unit 25 stores the updated conversation level data in the conversation level data storage unit 24 .
 以上の実施の形態2では、空間が車両の車室であって、空間内に存在している複数のユーザが、車両に乗車している複数の乗員であり、ナビゲーション装置4に設定されている目的地、又は、車両の走行経路から、車両の走行目的を予測する走行目的予測部22を備えるように、図6に示す音声認識装置5を構成した。また、音声認識装置5の会話度合更新部23は、発話者についての過去の会話の度合を示す走行目的別の複数の会話度合データの中から、走行目的予測部22により予測された走行目的の会話度合データを取得し、会話者有無判定部14により会話者が存在していると判定されれば、会話の度合を高めるように、取得した会話度合データを更新し、会話者有無判定部14により会話者が存在していないと判定されれば、会話の度合を低めるように、取得した会話度合データを更新するように構成した。したがって、図6に示す音声認識装置5は、図1に示す音声認識装置5よりも更に、ユーザの発話が、音声認識装置5に対する発話であるときに、ユーザ同士の会話であると誤認する確率を低減することができる。 In the above-described second embodiment, the space is the passenger compartment of the vehicle, and the plurality of users present in the space are the plurality of passengers in the vehicle, which are set in the navigation device 4. The speech recognition device 5 shown in FIG. 6 is configured to include a travel purpose prediction unit 22 that predicts the travel purpose of the vehicle from the destination or the travel route of the vehicle. Further, the conversation degree updating unit 23 of the speech recognition device 5 updates the driving purpose predicted by the driving purpose prediction unit 22 from among a plurality of pieces of conversation degree data for each driving purpose indicating the past conversation degree of the speaker. Conversation level data is acquired, and if it is determined by the speaker presence/absence determining unit 14 that a speaker exists, the acquired conversation level data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit 14 If it is determined that there is no talker, the acquired conversation degree data is updated so as to lower the degree of conversation. Therefore, the speech recognition device 5 shown in FIG. 6 has a higher probability than the speech recognition device 5 shown in FIG. can be reduced.
実施の形態3.
 実施の形態3では、発話者についての過去の会話の度合を示す座席位置別の複数の会話度合データの中から、発話者特定部11により特定された座席位置の会話度合データを取得する会話度合更新部26を備える音声認識装置5について説明する。
Embodiment 3.
In the third embodiment, conversation degree data for the seat position identified by the speaker identification unit 11 is acquired from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker. The speech recognition device 5 including the updating unit 26 will be described.
 図8は、実施の形態3に係る音声認識装置5を示す構成図である。
 図9は、実施の形態3に係る音声認識装置5のハードウェアを示すハードウェア構成図である。
 図8及び図9において、図1及び図2と同一符号は同一又は相当部分を示すので説明を省略する。
FIG. 8 is a configuration diagram showing a speech recognition device 5 according to Embodiment 3. As shown in FIG.
FIG. 9 is a hardware configuration diagram showing hardware of the speech recognition device 5 according to the third embodiment.
In FIGS. 8 and 9, the same reference numerals as those in FIGS. 1 and 2 denote the same or corresponding parts, so description thereof will be omitted.
 会話度合更新部26は、例えば、図9に示す会話度合更新回路37によって実現される。
 会話度合更新部26は、会話度合データ記憶部27及び会話度合更新処理部28を備えている。
 会話度合更新部26は、発話者特定部11から、それぞれの乗員が座っている座席の位置を示す座席位置データと、発話者の識別情報とを取得する。
 会話度合更新部26は、識別情報が示す発話者についての過去の会話の度合を示す座席位置別の複数の会話度合データの中から、座席位置データが示す座席位置の会話度合データを取得する。
 図8に示す音声認識装置5では、会話度合更新部26が、内部の会話度合データ記憶部27から、会話度合データを取得している。しかし、これは一例に過ぎず、会話度合更新部26が、音声認識装置5の外部から会話度合データを取得するようにしてもよい。
 会話度合更新部26は、会話者有無判定部14により会話者が存在していると判定されれば、会話の度合を高めるように、取得した会話度合データを更新する。
 会話度合更新部26は、会話者有無判定部14により会話者が存在していないと判定されれば、会話の度合を低めるように、取得した会話度合データを更新する。
The conversation level update unit 26 is implemented by, for example, a conversation level update circuit 37 shown in FIG.
The conversation level update unit 26 includes a conversation level data storage unit 27 and a conversation level update processing unit 28 .
The conversation level update unit 26 acquires seat position data indicating the positions of the seats on which the respective passengers are seated, and speaker identification information from the speaker identification unit 11 .
A conversation degree update unit 26 acquires the conversation degree data of the seat position indicated by the seat position data from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker indicated by the identification information.
In the speech recognition apparatus 5 shown in FIG. 8, the conversation degree update unit 26 acquires the conversation degree data from the conversation degree data storage unit 27 inside. However, this is only an example, and the conversation level update unit 26 may acquire the conversation level data from outside the speech recognition device 5 .
The conversation level update unit 26 updates the acquired conversation level data so as to increase the degree of conversation when the speaker presence/absence determination unit 14 determines that a speaker exists.
If the speaker presence/absence determining unit 14 determines that there is no speaker, the conversation level updating unit 26 updates the acquired conversation level data so as to lower the level of conversation.
 会話度合データ記憶部27は、座席位置別の複数の会話度合データを記憶している記憶媒体である。車室内に複数の乗員が存在している場合、それぞれの乗員が発話者になり得るため、それぞれの乗員について、座席位置別の複数の会話度合データを記憶している。
車室内に存在している乗員の人数が例えば3人であって、3人の乗員がC1、C2、C3であるとする。このとき、例えば、乗員C1が運転席に座り、乗員C2が助手席に座り、乗員C3が後部座席に座るパターンP1と、乗員C1が運転席に座り、乗員C2が後部席に座り、乗員C3が助手席に座るパターンP2とがあれば、会話度合データ記憶部27は、乗員C1についての会話度合データとして、パターンP1に係る会話度合データと、パターンP2に係る会話度合データとを記憶している。
 例えば、乗員C2が運転席に座り、乗員C1が助手席に座り、乗員C3が後部座席に座るパターンP3と、乗員C2が運転席に座り、乗員C1が後部席に座り、乗員C3が助手席に座るパターンP4とがあれば、会話度合データ記憶部27は、乗員C2についての会話度合データとして、パターンP1に係る会話度合データと、パターンP2に係る会話度合データと、パターンP3に係る会話度合データと、パターンP4に係る会話度合データとを記憶している。
The conversation level data storage unit 27 is a storage medium that stores a plurality of conversation level data for each seat position. When there are a plurality of passengers in the vehicle, each passenger can become a speaker. Therefore, a plurality of conversation degree data for each seat position are stored for each passenger.
Assume that the number of occupants present in the passenger compartment is, for example, three, and the three occupants are C1, C2, and C3. At this time, for example, a pattern P1 in which the occupant C1 sits in the driver's seat, the occupant C2 sits in the front passenger seat, and the occupant C3 sits in the rear seat; sits in the front passenger seat, the conversation degree data storage unit 27 stores the conversation degree data according to the pattern P1 and the conversation degree data according to the pattern P2 as the conversation degree data for the passenger C1. there is
For example, a pattern P3 in which the passenger C2 sits in the driver's seat, the passenger C1 sits in the passenger seat, and the passenger C3 sits in the rear seat, and the passenger C2 sits in the driver's seat, the passenger C1 sits in the rear seat, and the passenger C3 sits in the passenger seat. If there is a pattern P4 of sitting on the floor, the conversation degree data storage unit 27 stores the conversation degree data for the pattern P1, the conversation degree data for the pattern P2, and the conversation degree data for the pattern P3 as the conversation degree data for the passenger C2. data and conversation degree data relating to pattern P4.
 会話度合更新処理部28は、乗員特定部12から、それぞれの乗員が座っている座席の位置を示す座席位置データを取得し、発話者特定処理部13から、発話者の識別情報を取得する。
 会話度合更新処理部28は、会話度合データ記憶部27に記憶されている座席位置別の複数の会話度合データの中から、識別情報が示す発話者の会話度合データであり、かつ、座席位置データが示す座席位置の会話度合データを取得する。
 会話度合更新処理部28は、会話者有無判定部14から出力された判定結果が、会話者が存在している旨を示していれば、会話の度合を高めるように、取得した会話度合データを更新する。
 会話度合更新処理部28は、会話者有無判定部14から出力された判定結果が、会話者が存在していない旨を示していれば、会話の度合を低めるように、取得した会話度合データを更新する。
 会話度合更新処理部28は、更新後の会話度合データを会話度合データ記憶部27に記憶させる。
The conversation degree update processing unit 28 acquires seat position data indicating the position of the seat where each passenger is seated from the occupant identification unit 12 and acquires speaker identification information from the speaker identification processing unit 13 .
The conversation degree update processing unit 28 selects the conversation degree data of the speaker indicated by the identification information from among the plurality of conversation degree data for each seat position stored in the conversation degree data storage unit 27, and the seat position data. Acquire the degree of conversation data for the seat position indicated by .
If the determination result output from the speaker presence/absence determining unit 14 indicates that a speaker is present, the conversation level update processing unit 28 updates the acquired conversation level data so as to increase the level of conversation. Update.
If the determination result output from the speaker presence/absence determining unit 14 indicates that the speaker does not exist, the conversation level update processing unit 28 updates the acquired conversation level data so as to lower the level of conversation. Update.
The conversation level update processing unit 28 causes the conversation level data storage unit 27 to store the updated conversation level data.
 図8に示す音声認識装置5では、会話度合更新部26を図1に示す音声認識装置5に適用している。しかし、これは一例に過ぎず、会話度合更新部26を図6に示す音声認識装置5に適用するようにしてもよい。
 会話度合更新部26を図6に示す音声認識装置5に適用する場合、会話度合データ記憶部27は、走行目的別及び座席位置別のそれぞれにおける複数の会話度合データを記憶している。
 会話度合更新処理部28は、会話度合データ記憶部27に記憶されている複数の会話度合データの中から、発話者特定処理部13から出力された識別情報が示す発話者の会話度合データであり、かつ、走行目的データが示す走行目的の会話度合データであり、かつ、座席位置データが示す座席位置の会話度合データを取得する。
In the speech recognition device 5 shown in FIG. 8, the conversation level updating unit 26 is applied to the speech recognition device 5 shown in FIG. However, this is only an example, and the conversation level updating unit 26 may be applied to the speech recognition device 5 shown in FIG.
When the conversation level update unit 26 is applied to the speech recognition device 5 shown in FIG. 6, the conversation level data storage unit 27 stores a plurality of conversation level data for each driving purpose and seat position.
The conversation degree update processing unit 28 updates the conversation degree data of the speaker indicated by the identification information output from the speaker identification processing unit 13 from among the plurality of conversation degree data stored in the conversation degree data storage unit 27. Also, conversation degree data for the driving purpose indicated by the driving purpose data and at the seat position indicated by the seat position data is acquired.
 図8では、音声認識装置5の構成要素である発話者特定部11、会話者有無判定部14、会話度合更新部26及び応答部18のそれぞれが、図9に示すような専用のハードウェアによって実現されるものを想定している。即ち、音声認識装置5が、発話者特定回路31、会話者有無判定回路32、会話度合更新回路37及び応答回路34によって実現されるものを想定している。
 発話者特定回路31、会話者有無判定回路32、会話度合更新回路37及び応答回路34のそれぞれは、例えば、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ASIC、FPGA、又は、これらを組み合わせたものが該当する。
In FIG. 8, each of the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation degree update unit 26, and the response unit 18, which are the constituent elements of the speech recognition device 5, are implemented by dedicated hardware as shown in FIG. Assuming it will be implemented. That is, it is assumed that the speech recognition device 5 is implemented by a speaker identification circuit 31, a speaker presence/absence determination circuit 32, a conversation degree update circuit 37, and a response circuit .
Each of the speaker identification circuit 31, the talker presence/absence determination circuit 32, the conversation degree update circuit 37, and the response circuit 34 is, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, ASIC, FPGA, Or a combination of these applies.
 音声認識装置5の構成要素は、専用のハードウェアによって実現されるものに限るものではなく、音声認識装置5が、ソフトウェア、ファームウェア、又は、ソフトウェアとファームウェアとの組み合わせによって実現されるものであってもよい。
 音声認識装置5が、ソフトウェア又はファームウェア等によって実現される場合、発話者特定部11、会話者有無判定部14、会話度合更新部26及び応答部18におけるそれぞれの処理手順をコンピュータに実行させるためのプログラムが図3に示すメモリ41に格納される。そして、図3に示すプロセッサ42がメモリ41に格納されているプログラムを実行する。
The components of the speech recognition device 5 are not limited to those realized by dedicated hardware, and the speech recognition device 5 may be realized by software, firmware, or a combination of software and firmware. good too.
When the speech recognition device 5 is realized by software, firmware, or the like, there is a program for causing a computer to execute respective processing procedures in the speaker identification unit 11, the speaker presence/absence determination unit 14, the conversation level update unit 26, and the response unit 18. A program is stored in the memory 41 shown in FIG. Then, the processor 42 shown in FIG. 3 executes the program stored in the memory 41 .
 また、図9では、音声認識装置5の構成要素のそれぞれが専用のハードウェアによって実現される例を示し、図3では、音声認識装置5がソフトウェア又はファームウェア等によって実現される例を示している。しかし、これは一例に過ぎず、音声認識装置5における一部の構成要素が専用のハードウェアによって実現され、残りの構成要素がソフトウェア又はファームウェア等によって実現されるものであってもよい。 9 shows an example in which each component of the speech recognition device 5 is realized by dedicated hardware, and FIG. 3 shows an example in which the speech recognition device 5 is realized by software, firmware, or the like. . However, this is only an example, and some components of the speech recognition device 5 may be implemented by dedicated hardware, and the remaining components may be implemented by software, firmware, or the like.
 次に、図8に示す音声認識装置5の動作について説明する。
 会話度合更新部26以外は、図1に示す音声認識装置5と同様であるため、ここでは、主に、会話度合更新部26の動作について説明する。
 乗員特定部12は、実施の形態1と同様に、それぞれの乗員が座っている位置を特定する。
 乗員特定部12は、それぞれの乗員が座っている座席の位置を示す座席位置データを会話度合更新処理部28に出力する。
 発話者特定処理部13は、発話者の識別情報を会話度合更新処理部28に出力する。
Next, the operation of the speech recognition device 5 shown in FIG. 8 will be described.
Since the parts other than the conversation level updating unit 26 are the same as the speech recognition apparatus 5 shown in FIG. 1, the operation of the conversation level updating unit 26 will be mainly described here.
The occupant identification unit 12 identifies the position where each occupant is seated, as in the first embodiment.
The occupant identification unit 12 outputs seat position data indicating the position of the seat on which each occupant is seated to the conversation degree update processing unit 28 .
The speaker identification processing unit 13 outputs speaker identification information to the conversation level update processing unit 28 .
 会話度合更新処理部28は、乗員特定部12から、それぞれの乗員が座っている座席の位置を示す座席位置データを取得し、発話者特定処理部13から、発話者の識別情報を取得する。
 会話度合更新処理部28は、会話度合データ記憶部27に記憶されている座席位置別の複数の会話度合データの中から、識別情報が示す発話者の会話度合データであり、かつ、座席位置データが示す座席位置の会話度合データを取得する。
 車室内に存在している乗員の人数が例えば3人であって、3人の乗員がC1、C2、C3であるとする。このとき、例えば、乗員C1が運転席に座り、乗員C2が助手席に座り、乗員C3が後部座席に座るパターンP1と、乗員C1が運転席に座り、乗員C2が後部席に座り、乗員C3が助手席に座るパターンP2とがあれば、会話度合データ記憶部27は、乗員C1についての会話度合データとして、パターンP1に係る会話度合データと、パターンP2に係る会話度合データとを記憶している。乗員C1は、乗員C2が助手席に座っている場合、乗員C3が助手席に座っている場合よりも、仮に、発話が減る傾向にあるとすれば、パターンP1に係る会話度合データが示す会話の度合Kは、パターンP2に係る会話度合データが示す会話の度合Kよりも小さくなる。
 例えば、識別情報が示す発話者が乗員C1であり、会話度合データが、乗員C1が運転席に座り、乗員C2が助手席に座り、乗員C3が後部座席に座っている旨を示していれば、会話度合更新処理部28は、乗員C1についての会話度合データとして、パターンP1に係る会話度合データを取得する。
The conversation degree update processing unit 28 acquires seat position data indicating the position of the seat where each passenger is seated from the occupant identification unit 12 and acquires speaker identification information from the speaker identification processing unit 13 .
The conversation degree update processing unit 28 selects the conversation degree data of the speaker indicated by the identification information from among the plurality of conversation degree data for each seat position stored in the conversation degree data storage unit 27, and the seat position data. Acquire the degree of conversation data for the seat position indicated by .
Assume that the number of occupants present in the passenger compartment is, for example, three, and the three occupants are C1, C2, and C3. At this time, for example, a pattern P1 in which the occupant C1 sits in the driver's seat, the occupant C2 sits in the front passenger seat, and the occupant C3 sits in the rear seat; sits in the front passenger seat, the conversation degree data storage unit 27 stores the conversation degree data according to the pattern P1 and the conversation degree data according to the pattern P2 as the conversation degree data for the passenger C1. there is If crew member C1 tends to speak less when crew member C2 sits in the front passenger seat than when crew member C3 sits in the front passenger seat, the conversation indicated by the conversation degree data relating to pattern P1 The degree K of is smaller than the degree K of conversation indicated by the conversation degree data relating to pattern P2.
For example, if the speaker indicated by the identification information is the passenger C1, and the conversation degree data indicates that the passenger C1 sits in the driver's seat, the passenger C2 sits in the front passenger seat, and the passenger C3 sits in the rear seat. , the conversation degree update processing unit 28 acquires the conversation degree data related to the pattern P1 as the conversation degree data for the passenger C1.
 会話度合更新処理部28は、会話者有無判定部14から、発話者と会話している会話者が存在しているか否かを示す判定結果を取得する。
 会話度合更新処理部28は、判定結果が、会話者が存在している旨を示していれば、会話の度合Kを高めるように、取得した会話度合データを更新する。
 会話度合更新処理部28は、判定結果が、会話者が存在していない旨を示していれば、会話の度合Kを低めるように、取得した会話度合データを更新する。
 会話度合更新処理部28による会話度合データの更新処理は、図1に示す会話度合更新処理部17による会話度合データの更新処理と同様であるため、更新処理の具体的な説明は省略する。
 会話度合更新処理部28は、更新後の会話度合データを会話度合データ記憶部27に記憶させる。
The degree-of-conversation update processing unit 28 acquires from the speaker presence/absence determination unit 14 a determination result indicating whether or not there is a speaker who is having a conversation with the speaker.
The conversation degree update processing unit 28 updates the acquired conversation degree data so as to increase the degree K of conversation if the determination result indicates that there is a speaker.
The conversation level update processing unit 28 updates the acquired conversation level data so as to lower the conversation level K if the determination result indicates that there is no speaker.
The update processing of the conversation level data by the conversation level update processor 28 is the same as the update process of the conversation level data by the conversation level update processor 17 shown in FIG.
The conversation level update processing unit 28 causes the conversation level data storage unit 27 to store the updated conversation level data.
 以上の実施の形態3では、空間が車両の車室であって、空間内に存在している複数のユーザが、車両に乗車している複数の乗員であり、発話者特定部11は、車室内の映像、又は、車室内の音に基づいて、発話者及び会話者におけるそれぞれの座席位置を特定する。会話度合更新部26は、発話者についての過去の会話の度合を示す座席位置別の複数の会話度合データの中から、発話者特定部11により特定された座席位置の会話度合データを取得し、会話者有無判定部14により会話者が存在していると判定されれば、会話の度合を高めるように、取得した会話度合データを更新し、会話者有無判定部14により会話者が存在していないと判定されれば、会話の度合を低めるように、取得した会話度合データを更新するように、音声認識装置5を構成した。したがって、図8に示す音声認識装置5は、図1に示す音声認識装置5よりも更に、ユーザの発話が、音声認識装置5に対する発話であるときに、ユーザ同士の会話であると誤認する確率を低減することができる。 In the third embodiment described above, the space is a vehicle compartment, and the plurality of users present in the space are the plurality of passengers in the vehicle. The respective seat positions of the speaker and the person speaking are specified based on the indoor video or the sound inside the vehicle. The conversation degree update unit 26 acquires conversation degree data for the seat position identified by the speaker identification unit 11 from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker, If the speaker presence/absence determination unit 14 determines that a speaker exists, the acquired conversation degree data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit 14 determines whether a speaker exists. The speech recognition device 5 is configured so as to update the acquired conversation degree data so as to lower the degree of conversation if it is determined that there is no conversation. Therefore, the speech recognition device 5 shown in FIG. 8 has a higher probability than the speech recognition device 5 shown in FIG. can be reduced.
 なお、本開示は、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略が可能である。 It should be noted that the present disclosure allows free combination of each embodiment, modification of arbitrary constituent elements of each embodiment, or omission of arbitrary constituent elements in each embodiment.
 本開示は、音声認識装置及び音声認識方法に適している。 The present disclosure is suitable for speech recognition devices and speech recognition methods.
 1 カメラ、2 マイク、3 車載センサ、4 ナビゲーション装置、5 音声認識装置、6 車載機器、7 出力装置、11 発話者特定部、12 乗員特定部、13 発話者特定処理部、14 会話者有無判定部、15 会話度合更新部、16 会話度合データ記憶部、17 会話度合更新処理部、18 応答部、19 応答適否判定部、20 音声認識部、21 応答データ生成部、22 走行目的予測部、23 会話度合更新部、24 会話度合データ記憶部、25 会話度合更新処理部、26 会話度合更新部、27 会話度合データ記憶部、28 会話度合更新処理部、31 発話者特定回路、32 会話者有無判定回路、33 会話度合更新回路、34 応答回路、35 走行目的予測回路、36 会話度合更新回路、37 会話度合更新回路、41 メモリ、42 プロセッサ。 1 Camera, 2 Microphone, 3 In-vehicle sensor, 4 Navigation device, 5 Voice recognition device, 6 In-vehicle device, 7 Output device, 11 Speaker identification unit, 12 Passenger identification unit, 13 Speaker identification processing unit, 14 Speaker presence/absence determination 15 Conversation degree update unit 16 Conversation degree data storage unit 17 Conversation degree update processing unit 18 Response unit 19 Response suitability determination unit 20 Voice recognition unit 21 Response data generation unit 22 Travel purpose prediction unit 23 Conversation degree update unit 24 Conversation degree data storage unit 25 Conversation degree update processing unit 26 Conversation degree update unit 27 Conversation degree data storage unit 28 Conversation degree update processing unit 31 Speaker identification circuit 32 Talker presence/absence determination Circuit, 33 Conversation level update circuit, 34 Response circuit, 35 Driving purpose prediction circuit, 36 Conversation level update circuit, 37 Conversation level update circuit, 41 Memory, 42 Processor.

Claims (9)

  1.  カメラにより撮影された空間内の映像、又は、マイクにより集音された前記空間内の音に基づいて、前記空間内に存在している複数のユーザの中で、発話しているユーザである発話者を特定する発話者特定部と、
     前記複数のユーザの中で、前記発話者特定部により特定された発話者以外のユーザと、前記発話者との過去の会話の度合を示す会話度合データを取得し、前記会話度合データに基づいて、前記発話者の発話が、音声認識装置に対する発話であるか否かを判定し、前記音声認識装置に対する発話であるとの判定を行ったときに限り、前記発話者の発話に対する応答データを生成する応答部と
     を備えた音声認識装置。
    Utterance by a user who is speaking among a plurality of users existing in the space based on the video in the space captured by a camera or the sound in the space collected by a microphone a speaker identification unit that identifies a speaker;
    Obtaining conversation degree data indicating the degree of past conversation between a user other than the speaker identified by the speaker identification unit and the speaker among the plurality of users, and obtaining conversation degree data indicating a degree of past conversation with the speaker, and based on the conversation degree data determining whether or not the utterance of the utterer is directed to a speech recognition device, and generating response data to the utterance of the utterer only when it is determined that the utterance is directed to the speech recognition device; and a speech recognizer with a response unit that
  2.  前記発話者特定部により特定された発話者の人数に基づいて、前記複数のユーザの中に、前記発話者特定部により特定された発話者と会話しているユーザである会話者が存在しているか否かを判定する会話者有無判定部と、
     前記会話者有無判定部により会話者が存在していると判定されれば、会話の度合を高めるように、前記会話度合データを更新し、前記会話者有無判定部により会話者が存在していないと判定されれば、会話の度合を低めるように、前記会話度合データを更新する会話度合更新部と
     を備えたことを特徴とする請求項1記載の音声認識装置。
    Based on the number of speakers identified by the speaker identification unit, among the plurality of users, there is a talker who is a user conversing with the speaker identified by the speaker identification unit. a speaker presence/absence determination unit that determines whether or not there is a speaker;
    If the speaker presence/absence determination unit determines that a speaker exists, the conversation degree data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit determines that the speaker does not exist. 2. The speech recognition apparatus according to claim 1, further comprising: a conversation degree updating unit that updates the conversation degree data so as to lower the degree of conversation if it is determined that.
  3.  前記応答部は、前記発話者の発話が、前記音声認識装置に対する発話であるか否かを判定するための指標として、前記会話度合データが示す会話の度合が大きいほど小さな応答度合を算出し、前記応答度合が第1の閾値以上であれば、前記発話者の発話が、前記音声認識装置に対する発話であるとの判定を行い、前記応答度合が前記第1の閾値未満であれば、前記発話者の発話が、前記音声認識装置に対する発話ではないとの判定を行うことを特徴とする請求項1記載の音声認識装置。 The response unit, as an index for determining whether or not the utterance of the speaker is an utterance to the speech recognition device, calculates a smaller response degree as the degree of conversation indicated by the conversation degree data increases, If the degree of response is greater than or equal to the first threshold, the utterance of the speaker is determined to be an utterance to the speech recognition device, and if the degree of response is less than the first threshold, the utterance 2. The speech recognition device according to claim 1, wherein it is determined that a person's speech is not speech to said speech recognition device.
  4.  前記応答部は、前記会話度合データが示す会話の度合が第2の閾値以下であれば、前記発話者の発話が、前記音声認識装置に対する発話であるとの判定を行い、前記会話の度合が前記第2の閾値よりも大きければ、前記発話者の発話が、前記音声認識装置に対する発話ではないとの判定を行うことを特徴とする請求項1記載の音声認識装置。 If the degree of conversation indicated by the degree of conversation data is equal to or less than a second threshold, the response unit determines that the utterance of the speaker is an utterance to the speech recognition device, and determines that the degree of conversation is 2. The speech recognition apparatus according to claim 1, wherein if the second threshold value is exceeded, it is determined that the utterance of the utterer is not the utterance to the speech recognition apparatus.
  5.  前記空間が車両の車室であって、前記空間内に存在している複数のユーザが、前記車両に乗車している複数の乗員であり、
     ナビゲーション装置に設定されている目的地、又は、前記車両の走行経路から、前記車両の走行目的を予測する走行目的予測部を備え、
     前記会話度合更新部は、前記発話者についての過去の会話の度合を示す走行目的別の複数の会話度合データの中から、前記走行目的予測部により予測された走行目的の会話度合データを取得し、前記会話者有無判定部により会話者が存在していると判定されれば、会話の度合を高めるように、前記取得した会話度合データを更新し、前記会話者有無判定部により会話者が存在していないと判定されれば、会話の度合を低めるように、前記取得した会話度合データを更新することを特徴とする請求項2記載の音声認識装置。
    The space is a passenger compartment of a vehicle, and the plurality of users present in the space are a plurality of passengers on board the vehicle,
    A travel purpose prediction unit that predicts the travel purpose of the vehicle from the destination set in the navigation device or the travel route of the vehicle,
    The conversation degree update unit acquires the conversation degree data for the driving purpose predicted by the driving purpose prediction unit from among a plurality of conversation degree data for each driving purpose indicating the past conversation degree of the speaker. and if the speaker presence/absence determination unit determines that a speaker exists, the acquired conversation degree data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit determines whether a speaker exists. 3. The speech recognition apparatus according to claim 2, wherein if it is determined that the speech recognition apparatus does not, the acquired conversation degree data is updated so as to reduce the degree of conversation.
  6.  前記空間が車両の車室であって、前記空間内に存在している複数のユーザが、前記車両に乗車している複数の乗員であり、
     前記発話者特定部は、前記車室内の映像、又は、前記車室内の音に基づいて、前記発話者及び前記会話者におけるそれぞれの座席位置を特定し、
     前記会話度合更新部は、前記発話者についての過去の会話の度合を示す座席位置別の複数の会話度合データの中から、前記発話者特定部により特定された座席位置の会話度合データを取得し、前記会話者有無判定部により会話者が存在していると判定されれば、会話の度合を高めるように、前記取得した会話度合データを更新し、前記会話者有無判定部により会話者が存在していないと判定されれば、会話の度合を低めるように、前記取得した会話度合データを更新することを特徴とする請求項2記載の音声認識装置。
    The space is a passenger compartment of a vehicle, and the plurality of users present in the space are a plurality of passengers on board the vehicle,
    The speaker identification unit identifies seat positions of the speaker and the speaker based on the video in the vehicle interior or the sound in the vehicle interior,
    The conversation degree update unit acquires the conversation degree data for the seat position identified by the speaker identification unit from among a plurality of conversation degree data for each seat position indicating the past conversation degree of the speaker. and if the speaker presence/absence determination unit determines that a speaker exists, the acquired conversation degree data is updated so as to increase the degree of conversation, and the speaker presence/absence determination unit determines whether a speaker exists. 3. The speech recognition apparatus according to claim 2, wherein if it is determined that the speech recognition apparatus does not, the acquired conversation degree data is updated so as to reduce the degree of conversation.
  7.  前記会話者有無判定部は、前記カメラにより撮影された映像に基づいて、前記発話者以外のユーザの動作を解析し、前記動作の解析結果に基づいて、前記発話者以外のユーザが、前記発話者と会話しているか否かを判定することを特徴とする請求項2記載の音声認識装置。 The speaker presence/absence determination unit analyzes a motion of a user other than the speaker based on the image captured by the camera, and determines whether the user other than the speaker performs the utterance based on the analysis result of the motion. 3. The speech recognition apparatus according to claim 2, wherein it is determined whether or not the user is conversing with a person.
  8.  発話者特定部が、カメラにより撮影された空間内の映像、又は、マイクにより集音された前記空間内の音に基づいて、前記空間内に存在している複数のユーザの中で、発話しているユーザである発話者を特定し、
     応答部が、前記複数のユーザの中で、前記発話者特定部により特定された発話者以外のユーザと、前記発話者との過去の会話の度合を示す会話度合データを取得し、前記会話度合データに基づいて、前記発話者の発話が、音声認識装置に対する発話であるか否かを判定し、前記音声認識装置に対する発話であるとの判定を行ったときに限り、前記発話者の発話に対する応答データを生成する
     音声認識方法。
    A speaker identification unit identifies, among a plurality of users existing in the space, speaking based on images in the space captured by a camera or sounds in the space collected by microphones. identify the speaker who is the user who is
    A response unit acquires conversation degree data indicating a degree of past conversation between a user other than the speaker identified by the speaker identification unit and the speaker among the plurality of users, and obtains the conversation degree. Based on the data, it is determined whether or not the utterance of the speaker is directed to the speech recognition device, and only when it is determined that the utterance is directed to the speech recognition device, A speech recognition method that produces response data.
  9.  会話者有無判定部が、前記発話者特定部により特定された発話者の人数に基づいて、前記複数のユーザの中に、前記発話者特定部により特定された発話者と会話しているユーザである会話者が存在しているか否かを判定し、
     会話度合更新部が、前記会話者有無判定部により会話者が存在していると判定されれば、会話の度合を高めるように、前記会話度合データを更新し、前記会話者有無判定部により会話者が存在していないと判定されれば、会話の度合を低めるように、前記会話度合データを更新する
     ことを特徴とする請求項8記載の音声認識方法。
    A speaker presence/absence determining unit determines, based on the number of speakers specified by the speaker specifying unit, among the plurality of users, users who are having a conversation with the speaker specified by the speaker specifying unit. determining whether a certain talker is present,
    A conversation degree update unit updates the conversation degree data so as to increase the degree of conversation when the speaker presence/absence determination unit determines that a speaker exists, and the conversation participant presence/absence determination unit updates the conversation degree data. 9. The speech recognition method according to claim 8, wherein if it is determined that the person does not exist, the conversation degree data is updated so as to reduce the degree of conversation.
PCT/JP2021/018019 2021-05-12 2021-05-12 Voice recognition device and voice recognition method WO2022239142A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/018019 WO2022239142A1 (en) 2021-05-12 2021-05-12 Voice recognition device and voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/018019 WO2022239142A1 (en) 2021-05-12 2021-05-12 Voice recognition device and voice recognition method

Publications (1)

Publication Number Publication Date
WO2022239142A1 true WO2022239142A1 (en) 2022-11-17

Family

ID=84028541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/018019 WO2022239142A1 (en) 2021-05-12 2021-05-12 Voice recognition device and voice recognition method

Country Status (1)

Country Link
WO (1) WO2022239142A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018216180A1 (en) * 2017-05-25 2018-11-29 三菱電機株式会社 Speech recognition device and speech recognition method
WO2019171732A1 (en) * 2018-03-08 2019-09-12 ソニー株式会社 Information processing device, information processing method, program, and information processing system
WO2020044543A1 (en) * 2018-08-31 2020-03-05 三菱電機株式会社 Information processing device, information processing method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018216180A1 (en) * 2017-05-25 2018-11-29 三菱電機株式会社 Speech recognition device and speech recognition method
WO2019171732A1 (en) * 2018-03-08 2019-09-12 ソニー株式会社 Information processing device, information processing method, program, and information processing system
WO2020044543A1 (en) * 2018-08-31 2020-03-05 三菱電機株式会社 Information processing device, information processing method, and program

Similar Documents

Publication Publication Date Title
ES2806204T3 (en) Voice recognition techniques for activation and related systems and methods
JP6466385B2 (en) Service providing apparatus, service providing method, and service providing program
JP3910898B2 (en) Directivity setting device, directivity setting method, and directivity setting program
JPH10187186A (en) Device and method for recognition, and device and method for learning
CN108146360A (en) Method, apparatus, mobile unit and the readable storage medium storing program for executing of vehicle control
JP2002091466A (en) Speech recognition device
US20200152203A1 (en) Agent device, agent presentation method, and storage medium
JP2017090612A (en) Voice recognition control system
JP2007219207A (en) Speech recognition device
JP2019158975A (en) Utterance system
JP2020060696A (en) Communication support system, communication support method, and program
CN110696756A (en) Vehicle volume control method and device, automobile and storage medium
JP2020126166A (en) Agent system, information processing apparatus, information processing method, and program
WO2019130399A1 (en) Speech recognition device, speech recognition system, and speech recognition method
CN111007968A (en) Agent device, agent presentation method, and storage medium
JP2004354930A (en) Speech recognition system
WO2020079733A1 (en) Speech recognition device, speech recognition system, and speech recognition method
US11709065B2 (en) Information providing device, information providing method, and storage medium
JP3838159B2 (en) Speech recognition dialogue apparatus and program
WO2022239142A1 (en) Voice recognition device and voice recognition method
JP6785889B2 (en) Service provider
JP4561222B2 (en) Voice input device
JP7065964B2 (en) Sound field control device and sound field control method
CN111861666A (en) Vehicle information interaction method and device
JP6833147B2 (en) Information processing equipment, programs and information processing methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941876

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21941876

Country of ref document: EP

Kind code of ref document: A1