CN111766303B - Voice acquisition method, device, equipment and medium based on acoustic environment evaluation - Google Patents

Voice acquisition method, device, equipment and medium based on acoustic environment evaluation Download PDF

Info

Publication number
CN111766303B
CN111766303B CN202010913557.XA CN202010913557A CN111766303B CN 111766303 B CN111766303 B CN 111766303B CN 202010913557 A CN202010913557 A CN 202010913557A CN 111766303 B CN111766303 B CN 111766303B
Authority
CN
China
Prior art keywords
acquisition
voice
signal
environment
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010913557.XA
Other languages
Chinese (zh)
Other versions
CN111766303A (en
Inventor
谢单辉
张伟彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Voiceai Technologies Co ltd
Original Assignee
Voiceai Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Voiceai Technologies Co ltd filed Critical Voiceai Technologies Co ltd
Priority to CN202010913557.XA priority Critical patent/CN111766303B/en
Publication of CN111766303A publication Critical patent/CN111766303A/en
Application granted granted Critical
Publication of CN111766303B publication Critical patent/CN111766303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N29/00Investigating or analysing materials by the use of ultrasonic, sonic or infrasonic waves; Visualisation of the interior of objects by transmitting ultrasonic or sonic waves through the object
    • G01N29/04Analysing solids
    • G01N29/11Analysing solids by measuring attenuation of acoustic waves
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01BMEASURING LENGTH, THICKNESS OR SIMILAR LINEAR DIMENSIONS; MEASURING ANGLES; MEASURING AREAS; MEASURING IRREGULARITIES OF SURFACES OR CONTOURS
    • G01B17/00Measuring arrangements characterised by the use of infrasonic, sonic or ultrasonic vibrations
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01BMEASURING LENGTH, THICKNESS OR SIMILAR LINEAR DIMENSIONS; MEASURING ANGLES; MEASURING AREAS; MEASURING IRREGULARITIES OF SURFACES OR CONTOURS
    • G01B17/00Measuring arrangements characterised by the use of infrasonic, sonic or ultrasonic vibrations
    • G01B17/02Measuring arrangements characterised by the use of infrasonic, sonic or ultrasonic vibrations for measuring thickness
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2291/00Indexing codes associated with group G01N29/00
    • G01N2291/01Indexing codes associated with the measuring variable
    • G01N2291/015Attenuation, scattering
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2291/00Indexing codes associated with group G01N29/00
    • G01N2291/02Indexing codes associated with the analysed material
    • G01N2291/023Solids

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Acoustics & Sound (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The application relates to a voice acquisition method and device based on acoustic environment assessment, computer equipment and a storage medium, wherein an environment image corresponding to an acquisition room is acquired, the environment image is detected to obtain an environment object and the material type of the environment object, the sound absorption coefficient matched with the material type and the distance between the environment object and a voice acquisition device are acquired, and the indoor impulse response of each preset voice acquisition position of the acquisition room is calculated according to the sound absorption coefficient and the distance; dividing each indoor impulse response according to the response time to obtain a voice signal and a reverberation signal of each preset voice collecting position, obtaining an environmental noise signal of the collecting chamber, carrying out acoustic environment evaluation on the collecting chamber according to the environmental noise signal, the voice signal and the reverberation signal of each preset voice collecting position, obtaining a target collecting strategy matched with an acoustic environment evaluation result, and carrying out voice collection according to the target collecting strategy. The method can improve the quality of the collected voice data.

Description

Voice acquisition method, device, equipment and medium based on acoustic environment evaluation
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for acquiring speech based on acoustic environment assessment, a computer device, and a storage medium.
Background
The collection of voice signals is the first step of the application of voice call, voice recognition, voiceprint identification and the like. Generally speaking, the higher the quality of the collected voice data, the better the performance of recognition, authentication, etc. Due to the physical characteristics of the sound signal, the acoustic environment is one of the important factors affecting the quality of the collected voice data. Therefore, it is extremely important to evaluate the acoustic environment when voice data is collected.
At present, voice acquisition equipment usually adopts a fixed acquisition mode to acquire voice data, or adjusts the acquisition mode of the voice acquisition equipment after evaluating an acoustic environment by manpower, however, the requirement of an evaluator is that professional acoustic environment evaluation knowledge is required, and manpower is consumed. Moreover, the actual acoustic environment is judged to be not accurate enough through human ears, so that the quality of the collected voice data is low.
Disclosure of Invention
In view of the above, it is necessary to provide a voice collecting method, apparatus, computer device and storage medium based on acoustic environment assessment, which can improve the quality of collected voice data, in view of the above technical problems.
A method of speech acquisition based on acoustic environment assessment, the method comprising:
acquiring an environment image corresponding to an acquisition room, and detecting the environment image to obtain an environment object and a material type corresponding to the environment object;
acquiring sound absorption coefficients matched with the material types and the distance between the environment object and a voice acquisition device, determining position coordinates of the voice acquisition device in the environment object according to the distance, and calculating indoor impulse responses of all preset voice acquisition positions of the acquisition room according to the sound absorption coefficients and the position coordinates of the voice acquisition device; the indoor impulse response carries a response time;
dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position;
acquiring an environmental noise signal of the acquisition room, calculating according to the voice signal of each preset voice acquisition position and the environmental noise signal to obtain a signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position to obtain a signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result;
and acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy.
In one embodiment, the detecting the environment image to obtain an environment object and a material type corresponding to the environment object includes:
performing feature extraction on the environment image to obtain shape features of the environment image;
detecting an environment object from the environment image according to the shape feature;
extracting texture features of the environment object from the environment image;
and classifying and identifying the texture features to obtain the material type corresponding to the environment object.
In one embodiment, the calculating the indoor impulse response of each preset voice collecting position of the collecting chamber according to the sound absorption coefficient and the position coordinates of the voice collecting device includes:
calculating a sound reflection coefficient matched with the material type according to the sound absorption coefficient;
calculating to obtain attenuation signals of all preset voice acquisition positions of the acquisition room according to the sound reflection coefficient and the position coordinates of the voice acquisition device;
acquiring sound velocity, and calculating to obtain transmission delay of each preset voice acquisition position of the acquisition chamber according to the position coordinates of the voice acquisition device and the sound velocity;
and calculating the attenuation signals of the preset voice acquisition positions and the transmission delay of the preset voice acquisition positions based on a mirror image sound source method to obtain the indoor impulse response of the preset voice acquisition positions.
In one embodiment, the preset conditions are that the signal-to-noise ratio of each preset voice acquisition position in a preset range with the voice acquisition device as the center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value
In one embodiment, the acquiring a target acquisition policy matched with the acoustic environment evaluation result and performing voice acquisition according to the target acquisition policy includes:
when the acoustic environment evaluation result meets a preset condition, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result;
generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting a user to move to the target acquisition position;
and acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of the user at the target acquisition position based on the target acquisition mode.
In one embodiment, the method further comprises:
when the acoustic environment evaluation result does not meet the preset condition, generating a second prompt message according to the acoustic environment evaluation result; the second prompt message is used for prompting a user to adjust the environment of the collection room;
after the user adjusts the environment of the acquisition room, returning to execute the acquisition of the environment image corresponding to the acquisition room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
In one embodiment, the acquiring the environmental noise signal of the collection chamber includes:
collecting the sound of the collection chamber through a voice collection device to obtain a sound electric signal corresponding to the sound;
extracting an environmental noise electrical signal from the sound electrical signal by performing voice activation detection on the sound electrical signal;
and acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain an environmental noise signal of the acquisition room.
A speech acquisition apparatus based on acoustic environment assessment, the apparatus comprising:
the image detection module is used for acquiring an environment image corresponding to the acquisition room, and detecting the environment image to obtain an environment object and a material type corresponding to the environment object;
the indoor impulse response calculation module is used for acquiring sound absorption coefficients matched with the material types and the distance between the environment object and the voice acquisition device, determining position coordinates of the voice acquisition device in the environment object according to the distance, and calculating indoor impulse responses of all preset voice acquisition positions of the acquisition room according to the sound absorption coefficients and the position coordinates of the voice acquisition device; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position;
the acoustic environment evaluation module is used for acquiring an environmental noise signal of the acquisition room, calculating according to the voice signal of each preset voice acquisition position and the environmental noise signal to obtain a signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position to obtain a signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result;
and the voice acquisition module is used for acquiring a target acquisition strategy matched with the acoustic environment evaluation result and acquiring voice according to the target acquisition strategy.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
According to the voice acquisition method and device based on acoustic environment assessment, the computer equipment and the storage medium, the environment image corresponding to the acquisition room is acquired, and the environment image is detected to obtain the environment object and the material type corresponding to the environment object; acquiring sound absorption coefficients matched with material types and distances of environment objects relative to a voice acquisition device, determining position coordinates of the voice acquisition device in the environment objects according to the distances, and calculating indoor pulse responses of all preset voice acquisition positions of an acquisition room according to the sound absorption coefficients and the positions of the voice acquisition device; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position; acquiring an environmental noise signal of an acquisition room, calculating according to the voice signal and the environmental noise signal of each preset voice acquisition position to obtain the signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and the reverberation signal of each preset voice acquisition position to obtain the signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result; and acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy. Compared with the traditional scheme, the voice acquisition method based on the acoustic environment assessment is different from the traditional scheme in that the acoustic environment of the acquisition room is assessed manually, and the acquisition mode of the voice acquisition equipment is adjusted manually.
Drawings
FIG. 1 is a diagram of an embodiment of an application environment of a speech acquisition method based on acoustic environment assessment;
FIG. 2 is a schematic flow chart of a speech acquisition method based on acoustic environment assessment according to an embodiment;
FIG. 3 is a flow diagram illustrating an exemplary method for detecting an environmental image;
FIG. 4 is a schematic flow chart illustrating voice collection via a target collection policy according to an embodiment;
FIG. 5 is a schematic flow chart illustrating voice capture via a target capture strategy according to another embodiment;
FIG. 6 is a flow diagram illustrating a method for acquiring an ambient noise signal according to one embodiment;
FIG. 7 is a flow chart illustrating a speech acquisition method based on acoustic environment assessment in another embodiment;
FIG. 8 is a block diagram of a speech acquisition device based on acoustic environment assessment in one embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The voice acquisition method based on acoustic environment assessment provided by the application can be applied to the application environment shown in fig. 1. The voice acquisition device 102 is placed in the acquisition room 104, the voice acquisition device 102 can evaluate the acoustic environment of the acquisition room 104 to obtain an acoustic environment evaluation result of the acquisition room 104, and voice acquisition is performed in the acquisition room 104 through a voice acquisition strategy corresponding to the acoustic environment evaluation result of the acquisition room 104. The voice capture device 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices configured with voice capture devices and sensors.
Specifically, the voice capturing device 102 acquires an environmental image corresponding to the capturing room 104 through a sensor. The voice collecting device 102 detects the environment image to obtain the environment object and the material type corresponding to the environment object. The voice acquisition equipment 102 acquires the sound absorption coefficient matched with the material type and the distance between the environment object and the voice acquisition device, determines the position coordinate of the voice acquisition device in the environment object according to the distance, and calculates the indoor impulse response of each preset voice acquisition position of the acquisition room 104 according to the sound absorption coefficient and the position coordinate of the voice acquisition device, wherein the indoor impulse response carries the response time. The voice collecting device 102 divides the indoor impulse response of each preset voice collecting position according to the response time to obtain the voice signal of each preset voice collecting position and the reverberation signal of each preset voice collecting position.
Further, the voice collecting device 102 obtains the ambient noise signal of the collecting chamber 104 through the voice collecting device, calculates according to the voice signal and the ambient noise signal of each preset voice collecting position to obtain the signal-to-noise ratio of each preset voice collecting position, calculates according to the voice signal of each preset voice collecting position and the reverberation signal of each preset voice collecting position to obtain the signal-to-mixing ratio of each preset voice collecting position, and evaluates the acoustic environment of the collecting chamber 104 according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice collecting position to obtain the acoustic environment evaluation result. The voice collecting device 102 obtains a target collecting strategy matched with the acoustic environment evaluation result, and performs voice collection in the collecting chamber 104 according to the target collecting strategy.
In one embodiment, as shown in fig. 2, a speech acquisition method based on acoustic environment evaluation is provided, which is illustrated by applying the method to the speech acquisition device in fig. 1, and includes the following steps:
step 202, obtaining an environment image corresponding to the collection room, and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
The environment image is image data of an environment of the acquisition room. The environment image can be acquired by acquiring the acquisition room when the voice acquisition equipment is started for the first time, or acquired after a scene is switched, or acquired by issuing an environment image acquisition instruction by a user of the voice acquisition equipment and executing an environment image acquisition operation in response to the environment image acquisition instruction. The switching scenario may specifically be to replace the collection room, change the environment of the collection room, or adjust the position of the voice collection device.
The collection room is a room for collecting voice, and specifically can be a sample voice collection room, an interrogation room, a conference room and the like. The environmental objects refer to the walls, floor and ceiling of the collection room. The material types of the wall surface comprise mud, paint, wallpaper, wall cloth, artificial decorative boards, stone, ceramics, glass, metal and the like. The material types of the ground comprise concrete, stone, wood, tile, carpet and the like. The ceiling board is made of gypsum board, sound-absorbing mineral wool board, plywood, irregular long-strip aluminum gusset plate and the like.
Specifically, the voice acquisition device can acquire an environmental image corresponding to the acquisition room by arranging a sensor. When the sensor types are different, the data types of the obtained environment images can also be different, such as a depth image, an infrared image, a point cloud, a gray scale image, a color image and the like.
Further, after the environment image of the collection room is obtained, the voice collection device detects the environment image. When the environment image is detected, the voice acquisition equipment can call the pre-trained image detection model and input the environment image into the pre-trained image detection model. And detecting the environment image through a pre-trained image detection model, and outputting an environment object in the environment image and a material type corresponding to the environment object.
Specifically, the pre-trained image detection model may include target detection and material identification. The pre-trained image detection model performs target detection on the environment image to obtain environment objects in the environment image, namely the wall surface, the ground surface and the ceiling in the collection room. The pre-trained image detection model extracts the characteristics of the environment object from the environment image, and the material type of the environment object is obtained by identifying the material of the environment object through the characteristics of the environment object.
In one embodiment, the pre-trained image detection model can also detect size information of the environmental object through target detection. Such as the width and height of the wall surface, etc. The voice acquisition equipment can calculate the size information of the acquisition room, namely the length, width and height of the acquisition room, through the size information of the environment object.
In one embodiment, the pre-trained image detection model can also detect the distance between the environmental object and the voice acquisition device through target detection. The voice acquisition equipment can also calculate the size information of the acquisition room through the distance between the environment object and the voice acquisition equipment, and can also calculate the position of the voice acquisition equipment in the acquisition room.
In one embodiment, the pre-trained image detection model can also identify the surface smoothness of the environment object through material identification.
Step 204, obtaining sound absorption coefficients matched with material types and distances of the environment objects relative to the voice acquisition devices, determining position coordinates of the voice acquisition devices in the environment objects according to the distances, and calculating indoor impulse responses of all preset voice acquisition positions of the acquisition room according to the sound absorption coefficients and the position coordinates of the voice acquisition devices; the indoor impulse response carries a response time.
The sound absorption coefficient is a coefficient reflecting the sound absorption capability of a material or a structure. The sound absorption coefficients of different materials are different, for example, for a sound signal with a frequency of 500Hz, the sound absorption coefficient of stone materials is 0.01, and the sound absorption coefficient of concrete or mud materials is 0.02. The indoor impulse response refers to a signal sequence radiated by an impulse sound source received at a receiving position in a sound field. In this embodiment, the receiving position refers to a position of the voice capturing apparatus.
Specifically, after the voice acquisition equipment identifies the material type (or the material type and the surface smoothness) corresponding to the wall surface, the ground surface and the ceiling of the acquisition room, the sound absorption coefficient table of the material can be read from the local storage, the sound absorption coefficient matched with the corresponding material type can be searched from the sound absorption coefficient table of the material, the server can also be accessed, the sound absorption coefficient matched with the corresponding material type can be searched through networking, and therefore the sound absorption coefficient corresponding to the wall surface, the ground surface and the ceiling of the acquisition room can be obtained.
The voice acquisition equipment can detect the acquisition room through an infrared sensor, and the distances between the surrounding wall surfaces, the ground and the ceiling in the acquisition room and the voice acquisition equipment are obtained. The voice acquisition equipment can calculate the length, width and height of the acquisition room and the position of the voice acquisition equipment in the acquisition room according to the distances between the surrounding wall surfaces, the ground and the ceiling in the acquisition room and the voice acquisition equipment.
Furthermore, the voice collection equipment calculates the sound reflection coefficients corresponding to the wall surface, the ground surface and the ceiling of the collection room through the sound absorption coefficients corresponding to the wall surface, the ground surface and the ceiling of the collection room. The specific calculation process is that the sound absorption coefficient is subtracted from 1 to obtain the corresponding sound reflection coefficient.
After obtaining the length, width, and height of the collection room and the position of the voice collection device in the collection room, the voice collection device may select a virtual sound source point in the collection room. For example, the length, width, and height of the collection room are all 3 meters, and the voice collection device is located at the center of the collection room, so that a point at every 30 degrees in a 3D angle with a distance of 0.5 meter, 0.8 meter, 1.0 meter, 1.2 meter, and 1.5 meter from the voice collection device may be selected as a virtual sound source point. The voice acquisition device acquires position information of each virtual sound source point, namely position coordinates of each virtual sound source point relative to the voice acquisition device (namely the voice acquisition device). For example, the voice collecting device is arranged at a corner in the collecting room, and then the corner can be used as a coordinate origin to establish a coordinate system, and position coordinates of each virtual sound source point in the coordinate system can be obtained.
Based on the mirror image sound source method, the voice collection equipment calculates the sound reflection coefficients corresponding to the wall surface, the ground and the ceiling of the collection room and the position coordinates of each virtual sound source point to obtain the attenuation signals of each virtual sound source point. The voice acquisition equipment acquires the sound velocity, calculates the sound velocity and the position coordinates of each virtual sound source point to obtain the transmission delay of each virtual sound source point, calculates the attenuation signal and the transmission delay of each virtual sound source point to obtain the indoor impulse response of each virtual sound source point, namely the indoor impulse response of each preset voice acquisition position of the acquisition room.
And step 206, dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain the voice signal of each preset voice acquisition position and the reverberation signal of each preset voice acquisition position.
Wherein the indoor impulse response comprises direct sound, early reflected sound and late reverberant sound. Direct sound refers to unreflected sound. Early reflections refer to sounds produced by primary or secondary reflections. Late reverberation is sound that is produced through multiple reflections. The speech signal includes direct sound and early reflected sound. The reverberant signal comprises late reverberant sound.
In particular, the indoor impulse response carries a response time. The corresponding response times are different because the reflection times of the direct sound, the early reflected sound and the late reverberant sound are different. Therefore, the room impulse response can be divided by the response time to obtain the direct sound, the early reflected sound and the late reverberant sound. In one embodiment, the spikes in the room impulse response may be divided into direct sound, with signals within 50 milliseconds after the direct sound being treated as early reflections and signals 50 milliseconds after the direct sound being treated as late reverberant sounds.
Further, the voice collecting device correspondingly combines the direct sound of each preset voice collecting position and the early reflected sound of each preset voice collecting position to obtain the voice signal of each preset voice collecting position, and the late reverberation sound of each preset voice collecting position is used as the reverberation signal of each preset voice collecting position.
And step 208, acquiring an environmental noise signal of the acquisition room, calculating according to the voice signal and the environmental noise signal of each preset voice acquisition position to obtain a signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position to obtain a signal-to-noise ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-noise ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result.
Specifically, the voice collecting device can obtain an environmental noise signal of the collecting chamber through the voice collecting device, then calculate the environmental noise signal, the voice signal of each preset voice collecting position and the reverberation signal of each preset voice collecting position, and evaluate the acoustic environment of the collecting chamber through the calculation result.
The voice acquisition equipment calculates the voice signals and the environmental noise signals of all the preset voice acquisition positions to obtain the signal-to-noise ratio of all the preset voice acquisition positions, and calculates the voice signals of all the preset voice acquisition positions and the reverberation signals of all the preset voice acquisition positions to obtain the signal-to-mixture ratio of all the preset voice acquisition positions. Specifically, the voice collecting device calculates and obtains the effective power of the voice signal of each preset voice collecting position, the effective power of the reverberation signal of each preset voice collecting position and the effective power of the ambient noise signal. And calculating the ratio of the effective power of the voice signal of each preset voice acquisition position to the effective power of the environmental noise signal to obtain the signal-to-noise ratio of each preset voice acquisition position, and calculating the ratio of the effective power of the voice signal of each preset voice acquisition position to the effective power of the reverberation signal of the same position to obtain the signal-to-noise ratio of each preset voice acquisition position.
Further, the voice acquisition equipment evaluates the acoustic environment of the acquisition room according to the signal-to-noise ratio of each preset voice acquisition position of the acquisition room and the signal-to-noise ratio of each preset voice acquisition position. In one embodiment, when the signal-to-noise ratio of at least one preset voice acquisition position in a preset range centered on the voice acquisition device is lower than a first threshold or the signal-to-noise ratio is lower than a second threshold, the acoustic environment of the acquisition room is poor and the voice acquisition is not suitable. For example, when the signal-to-noise ratio of a virtual sound source point existing within 1.5m centering on the voice acquisition device is lower than a first preset threshold or the signal-to-noise ratio is lower than a second preset threshold, it indicates that the current environment is not suitable for voice acquisition, and needs to be modified to some extent before voice acquisition. The preset range, the first threshold and the second threshold can be set as required. The setting may be specifically made according to the use of the collection room, for example, different preset ranges, a first threshold value and a second threshold value may be set for the collection room for conference and for interrogation.
And step 210, acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy.
The target acquisition strategy refers to a voice acquisition strategy matched with an acoustic environment evaluation result of an acquisition room. The voice acquisition strategy refers to a voice acquisition rule which is preset for the voice acquisition equipment, and different acoustic environment evaluation results correspond to different voice acquisition strategies. The voice acquisition strategy may include, but is not limited to, a target acquisition location, a target acquisition distance, a target acquisition mode, and the like.
Specifically, after obtaining the acoustic environment evaluation result of the collection chamber, the voice collection device may obtain a target collection policy that matches the acoustic environment evaluation result, and perform voice collection in the collection chamber according to the target collection policy.
In the voice acquisition method based on the acoustic environment assessment, an environment image corresponding to an acquisition room is acquired; detecting the environment image to obtain an environment object and a material type corresponding to the environment object; acquiring sound absorption coefficients matched with material types and the distance between an environment object and a voice acquisition device, determining position coordinates of the voice acquisition device in the environment object according to the distance, and calculating indoor impulse responses of all preset voice acquisition positions of an acquisition room according to the sound absorption coefficients and the position coordinates of the voice acquisition device; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position; acquiring an environmental noise signal of an acquisition room, calculating according to the voice signal and the environmental noise signal of each preset voice acquisition position to obtain the signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and the reverberation signal of each preset voice acquisition position to obtain the signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result; and acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy. Compared with the traditional scheme, the voice acquisition method based on the acoustic environment assessment is different from the traditional scheme in that the acoustic environment of the acquisition room is assessed manually, and the acquisition mode of the voice acquisition equipment is adjusted manually.
In one embodiment, as shown in FIG. 3, step 202 comprises:
step 302, extracting the characteristics of the environment image to obtain the shape characteristics of the environment image;
step 304, detecting an environment object from the environment image according to the shape feature;
step 306, extracting texture features of the environment object from the environment image;
and 308, classifying and identifying the texture features to obtain the material type corresponding to the environment object.
The image features mainly comprise color features, texture features, shape features and spatial relationship features of the image. Texture features describe the surface properties of the scene to which the image or image region corresponds. The shape features include contour features and region features.
Specifically, the voice acquisition device performs preprocessing, such as graying processing and filtering processing, on the environment image, so that the voice acquisition device can better extract the shape characteristics of the environment image from the preprocessed environment image. After the shape features of the environment image are extracted and obtained, the voice acquisition equipment calls a pre-trained classifier, inputs the shape features of the environment image into the pre-trained classifier to classify and recognize the shape features of the environment image, and obtains environment objects in the environment image, namely the wall surface, the ground surface and the ceiling of the acquisition room.
In one embodiment, the voice acquisition device can also obtain the size information of the environment object through shape feature recognition of the environment image.
In one embodiment, the voice collecting device may further extract a spatial relationship feature of the environment image, and the distance between the environment object and the voice collecting device is obtained through spatial relationship feature recognition.
Further, after the environment object in the environment image is identified, the voice acquisition device extracts the texture feature of the environment object from the environment image, inputs the texture feature of the environment object into another pre-trained classifier, and performs classification and identification on the texture feature of the environment object through the pre-trained classifier, so as to identify and obtain the material type of the environment object.
In one embodiment, the voice acquisition device can also obtain the surface smoothness of the environment object through the texture feature recognition of the environment object.
It should be understood that the trained classifier can be implemented based on SVM (support vector machine), DT (Decision Tree), NBM (Naive Bayesian Model), GMM (Adaptive back knowledge models for real-time tracking), or classification neural Network.
In this embodiment, through detecting the environment image, can obtain the environment object in the environment image and the material type that the environment object corresponds, provide the basis for the acoustic environment of follow-up intelligent assessment collection room.
In one embodiment, as shown in FIG. 4, step 210 comprises:
step 402, when the acoustic environment evaluation result meets a preset condition, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result;
step 404, generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting the user to move to a target acquisition position;
and 406, acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of the user at the target acquisition position based on the target acquisition mode.
The preset conditions are that the signal-to-noise ratio of each preset voice acquisition position in a preset range with voice acquisition equipment (namely a voice acquisition device) as a center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value. For example, the preset conditions are that the signal-to-noise ratio of each preset voice collection position within 1.5m centered on the voice collection device is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value.
Specifically, when the acoustic environment evaluation result meets a preset condition, that is, when the signal-to-noise ratio of each preset voice collection position is higher than a first threshold and the signal-to-noise ratio is higher than a second threshold within a preset range centering on the voice collection device, the voice collection device counts all preset voice collection positions in the collection room where the signal-to-noise ratio is higher than the first threshold and the signal-to-noise ratio is higher than the second threshold, and selects one preset voice collection position as a target collection position from the statistical result. The target acquisition position can be a preset voice acquisition position with the highest signal-to-noise ratio, can also be a preset voice acquisition position with the highest signal-to-mixing ratio, and can also be a preset voice acquisition position with the highest signal-to-mixing ratio, wherein the preset voice acquisition position corresponding to the maximum weighting operation result is taken as the target acquisition position.
After the target acquisition position corresponding to the acquisition room is determined, the voice acquisition device generates a first prompt message based on the target acquisition position, specifically, the voice acquisition device may display the first prompt message through a display screen, or may emit a voice corresponding to the first prompt message through a speaker. After receiving the first prompt message, the user can move to the target acquisition position.
Further, the voice acquisition equipment calculates the target distance between the voice acquisition equipment and the target acquisition position through the target acquisition position, and determines the target acquisition mode according to the target distance. The target acquisition mode comprises a short-distance acquisition mode and a long-distance acquisition mode. In one embodiment, when the target distance is smaller than a preset threshold, a short-distance acquisition mode is adopted; and when the target distance is greater than a preset threshold value, adopting a remote acquisition mode. For example, when the target distance is less than 1 meter, a close-range acquisition mode is adopted; and when the target distance is more than 1 m, adopting a long-distance acquisition mode. After the target acquisition mode is determined, the voice acquisition equipment acquires voice of the user at the target acquisition position based on the target acquisition mode.
In one embodiment, the first prompting message may also be generated based on the target distance.
In one embodiment, when the voice acquisition device determines the target acquisition mode, the voice acquisition device may further obtain a target signal processing algorithm and a target parameter corresponding to the target acquisition mode, and perform data processing on the acquired voice through the target signal processing algorithm and the target parameter.
In one embodiment, as shown in fig. 5, step 210 further comprises:
step 502, when the acoustic environment evaluation result does not meet the preset condition, generating a second prompt message according to the acoustic environment evaluation result; the second prompt message is used for prompting the user to adjust the environment of the acquisition room;
step 504, after the user adjusts the environment of the collection room, returning to execute the acquisition of the environment image corresponding to the collection room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
Specifically, when the acoustic environment evaluation result does not satisfy the preset condition, that is, when the signal-to-noise ratio of at least one preset voice acquisition position in the preset range centered on the voice acquisition device is smaller than the first threshold or the signal-to-noise ratio is smaller than the second threshold, the acoustic environment of the acquisition room is poor and is not suitable for voice acquisition. At this time, the voice capture device generates a second prompt message. The voice collecting device may display the second prompt message through the display screen, or may send out a voice corresponding to the second prompt message through the speaker, as in the first prompt message. After receiving the second prompt message, the user may adjust the environment of the collecting room, for example, arrange sound-absorbing soft bags on the wall and ceiling around the collecting room, and lay a carpet on the ground.
Further, after the user adjusts the environment of the acquisition room, returning to execute the acquisition of the environment image corresponding to the acquisition room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object, evaluating the acoustic environment of the acquisition room after the environment is adjusted, and acquiring voice by adopting a corresponding target acquisition strategy to improve the quality of acquired voice data.
In this embodiment, the acoustic environment evaluation result is detected and judged to obtain a corresponding target acquisition strategy, and voice acquisition is performed through the target acquisition strategy, so that the quality of acquired voice data is improved.
In one embodiment, as shown in fig. 6, acquiring an ambient noise signal of a collection chamber comprises:
step 602, collecting the sound of a collection room through a voice collection device to obtain a sound electric signal corresponding to the sound;
step 604, extracting an environmental noise electrical signal from the sound electrical signal by performing voice activation detection on the sound electrical signal;
and 606, acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain an environmental noise signal of the acquisition room.
Among them, voice activity detection is a sound signal processing method that divides voice and noise.
Specifically, a voice acquisition device is configured in the voice acquisition equipment, and environmental noise signals of an acquisition room can be acquired through the voice acquisition device. However, when the environmental noise signal is collected, the voice collecting device cannot know whether a person speaks, so the voice collecting device can extract the environmental signal from the collected sound signal through the voice activation detection technology. Due to the characteristics of the voice acquisition device, when the voice acquisition device acquires the voice, the voice electric signal corresponding to the voice is obtained. In addition, the sensitivity of the voice acquisition device is different, and the obtained sound electric signals are also different. Therefore, the voice acquisition equipment extracts the environmental noise electric signal from the sound electric signal through the voice detection technology, can acquire the sensitivity of the voice acquisition device, performs electro-acoustic conversion on the environmental noise electric signal through the sensitivity of the voice acquisition device, and converts the environmental noise electric signal into the environmental noise signal, thereby obtaining the environmental noise signal of the acquisition room.
In the embodiment, the environmental noise electrical signal can be separated from the sound electrical signal by a voice activation detection technology, so that the accuracy of the environmental noise electrical signal is improved; and the environmental noise electrical signal is subjected to electro-acoustic conversion through the sensitivity of the voice acquisition device, so that the accuracy of the environmental noise signal can be further improved, the accuracy of acoustic environment assessment is improved, and the quality of acquired voice data is improved.
In one embodiment, as shown in fig. 7, another speech acquisition method based on acoustic environment evaluation is provided, which is illustrated by applying the method to the speech acquisition device in fig. 1, and includes the following steps:
step 702, acquiring an environment image corresponding to an acquisition room;
step 704, extracting the features of the environment image to obtain the shape features of the environment image; detecting an environment object from the environment image according to the shape feature;
step 706, extracting texture features of the environment object from the environment image; classifying and identifying the texture features to obtain a material type corresponding to the environment object;
step 708, obtaining sound absorption coefficients matched with the material types and distances of the environment objects relative to the voice acquisition device, and determining position coordinates of the voice acquisition device in the environment objects according to the distances;
step 710, calculating a sound reflection coefficient matched with the material type according to the sound absorption coefficient;
step 712, calculating attenuation signals of each preset voice acquisition position of the acquisition room according to the voice reflection coefficient and the position coordinates of the voice acquisition device;
step 714, obtaining sound velocity, and calculating to obtain transmission delay of each preset voice acquisition position of the acquisition room according to the position coordinates and the sound velocity of the voice acquisition device;
step 716, calculating the attenuation signals of the preset voice acquisition positions and the transmission delays of the preset voice acquisition positions based on a mirror image sound source method to obtain indoor impulse responses of the preset voice acquisition positions; the indoor impulse response carries a response time;
step 718, dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position;
step 720, collecting the sound of the collection chamber through a voice collection device to obtain a sound electric signal corresponding to the sound;
step 722, extracting an environmental noise electrical signal from the sound electrical signal by performing voice activation detection on the sound electrical signal;
step 724, acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain an environmental noise signal of an acquisition room;
step 726, calculating the voice signal and the environmental noise signal of each preset voice collecting position to obtain the signal-to-noise ratio of each preset voice collecting position;
step 728, calculating the voice signal of each preset voice collecting position and the reverberation signal of each preset voice collecting position to obtain the signal-to-mixing ratio of each preset voice collecting position;
step 730, evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio of each preset voice acquisition position and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result;
step 732, when the acoustic environment evaluation result does not meet the preset condition, entering step 734; when the acoustic environment evaluation result does not meet the preset condition, entering step 736;
step 734, generating a second prompt message according to the acoustic environment evaluation result; the second prompt message is used for prompting the user to adjust the environment of the acquisition room; after the user adjusts the environment of the collection room, return to step 702;
step 736, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result;
step 738, generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting the user to move to a target acquisition position;
and 740, acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of the user at the target acquisition position based on the target acquisition mode.
In this embodiment, the acoustic environment of the collection chamber is intelligently evaluated, the acoustic environment evaluation result is detected and judged, the corresponding target collection strategy is obtained, and voice collection is performed through the target collection strategy, so that not only can the labor be saved, but also the quality of collected voice data can be improved.
It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 8, there is provided a speech acquisition apparatus 800 based on acoustic environment assessment, comprising: an image detection module 801, an indoor impulse response calculation module 802, an acoustic environment assessment module 803, and a voice acquisition module 804, wherein:
the image detection module 801 is configured to acquire an environment image corresponding to the acquisition room, and detect the environment image to obtain an environment object and a material type corresponding to the environment object;
the indoor impulse response calculation module 802 is configured to obtain an acoustic absorption coefficient matched with the material type and a distance between the environment object and the voice acquisition device, determine a position coordinate of the voice acquisition device in the environment object according to the distance, and calculate an indoor impulse response of each preset voice acquisition position of the acquisition room according to the acoustic absorption coefficient and the position coordinate of the voice acquisition device; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position;
the acoustic environment evaluation module 803 is configured to acquire an environmental noise signal of the collection room, perform calculation according to the voice signal and the environmental noise signal of each preset voice collection position to obtain a signal-to-noise ratio of each preset voice collection position, perform calculation according to the voice signal of each preset voice collection position and the reverberation signal of each preset voice collection position to obtain a signal-to-mixing ratio of each preset voice collection position, and evaluate the acoustic environment of the collection room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice collection position to obtain an acoustic environment evaluation result;
and the voice acquisition module 804 is used for acquiring a target acquisition strategy matched with the acoustic environment evaluation result and acquiring voice according to the target acquisition strategy.
In one embodiment, the image detection module 801 is further configured to perform feature extraction on the environment image to obtain shape features of the environment image; detecting an environment object from the environment image according to the shape feature; extracting texture features of the environment object from the environment image; and classifying and identifying the texture features to obtain the material type corresponding to the environment object.
In one embodiment, the indoor impulse response calculation module 802 is further configured to calculate a sound reflection coefficient matching the material type according to the sound absorption coefficient; calculating to obtain attenuation signals of each preset voice acquisition position of the acquisition room according to the voice reflection coefficient and the position coordinates of the voice acquisition device; acquiring sound velocity, and calculating to obtain transmission delay of each preset voice acquisition position of the acquisition room according to the position coordinates and the sound velocity of the voice acquisition device; and calculating the attenuation signals of the preset voice acquisition positions and the transmission delay of the preset voice acquisition positions based on a mirror image sound source method to obtain the indoor impulse response of the preset voice acquisition positions.
In one embodiment, the voice collecting module 804 is further configured to determine, when the acoustic environment evaluation result satisfies a preset condition, a target collecting position corresponding to the collecting chamber according to the acoustic environment evaluation result; generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting the user to move to a target acquisition position; and acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of the user at the target acquisition position based on the target acquisition mode.
In one embodiment, the voice collecting module 804 is further configured to generate a second prompt message according to the acoustic environment evaluation result when the acoustic environment evaluation result does not satisfy the preset condition; the second prompt message is used for prompting the user to adjust the environment of the acquisition room; after the user adjusts the environment of the acquisition room, returning to execute the acquisition of the environment image corresponding to the acquisition room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
In one embodiment, the preset conditions are that the signal-to-noise ratio of each preset voice collection position in a preset range with the voice collection device as the center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value.
In one embodiment, the acoustic environment evaluation module 803 is further configured to collect the sound of the collection room by a voice collection device, and obtain a sound electrical signal corresponding to the sound; carrying out voice activation detection on the sound electric signal, and extracting an environmental noise electric signal from the sound electric signal; and acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain the environmental noise signal of the acquisition room.
For specific limitations of the speech acquisition apparatus based on the acoustic environment assessment, reference may be made to the above limitations of the speech acquisition method based on the acoustic environment assessment, which are not described herein again. The various modules in the above-described speech acquisition device based on acoustic environment assessment may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a voice capture device, the internal structure of which may be as shown in fig. 9. The computer equipment comprises a processor, a memory, a communication interface, a display screen, an input device, a sensor, a voice acquisition device and a loudspeaker which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of speech acquisition based on an assessment of an acoustic environment. The display of the computer device may be a liquid crystal display or an electronic ink display. The input device of the computer equipment can be a touch layer covered on a display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like. The sensor of the computer device may be a visual sensor, an infrared sensor, a sound wave sensor, a video camera, a depth camera, or the like. The voice collecting device of the computer equipment can be composed of one microphone or a plurality of microphones.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring an environment image corresponding to an acquisition room, and detecting the environment image to obtain an environment object and a material type corresponding to the environment object; acquiring sound absorption coefficients matched with the material types and the distance between the environment object and the voice acquisition device, and determining the position coordinate of the voice acquisition device in the environment object according to the distance; calculating according to the sound absorption coefficient and the position coordinates of the voice acquisition device to obtain indoor impulse responses of all preset voice acquisition positions of the acquisition room; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position; acquiring an environmental noise signal of an acquisition room, calculating according to the voice signal and the environmental noise signal of each preset voice acquisition position to obtain the signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and the reverberation signal of each preset voice acquisition position to obtain the signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result; and acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy.
In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting the characteristics of the environment image to obtain the shape characteristics of the environment image; detecting an environment object from the environment image according to the shape feature; extracting texture features of the environment object from the environment image; and classifying and identifying the texture features to obtain the material type corresponding to the environment object.
In one embodiment, the processor, when executing the computer program, further performs the steps of: calculating a sound reflection coefficient matched with the material type according to the sound absorption coefficient; calculating to obtain attenuation signals of each preset voice acquisition position of the acquisition room according to the voice reflection coefficient and the position coordinates of the voice acquisition device; acquiring sound velocity, and calculating to obtain transmission delay of each preset voice acquisition position of the acquisition room according to the position coordinates and the sound velocity of the voice acquisition device; and calculating the attenuation signals of the preset voice acquisition positions and the transmission delay of the preset voice acquisition positions based on a mirror image sound source method to obtain the indoor impulse response of the preset voice acquisition positions.
In one embodiment, the processor, when executing the computer program, further performs the steps of: when the acoustic environment evaluation result meets a preset condition, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result; generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting the user to move to a target acquisition position; and acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of the user at the target acquisition position based on the target acquisition mode.
In one embodiment, the processor, when executing the computer program, further performs the steps of: when the acoustic environment evaluation result does not meet the preset condition, generating a second prompt message according to the acoustic environment evaluation result; the second prompt message is used for prompting the user to adjust the environment of the acquisition room; after the user adjusts the environment of the acquisition room, returning to execute the acquisition of the environment image corresponding to the acquisition room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
In one embodiment, the preset conditions are that the signal-to-noise ratio of each preset voice collection position in a preset range with the voice collection device as the center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value.
In one embodiment, the processor, when executing the computer program, further performs the steps of: collecting the sound of a collection chamber through a voice collection device to obtain a sound electric signal corresponding to the sound; carrying out voice activation detection on the sound electric signal, and extracting an environmental noise electric signal from the sound electric signal; and acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain the environmental noise signal of the acquisition room.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring an environment image corresponding to an acquisition room; detecting the environment image to obtain an environment object and a material type corresponding to the environment object; acquiring sound absorption coefficients matched with material types and the distance between an environment object and a voice acquisition device, determining position coordinates of the voice acquisition device in the environment object according to the distance, and calculating indoor impulse responses of all preset voice acquisition positions of an acquisition room according to the sound absorption coefficients and the position coordinates of the voice acquisition device; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position; acquiring an environmental noise signal of an acquisition room, calculating according to the voice signal and the environmental noise signal of each preset voice acquisition position to obtain the signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and the reverberation signal of each preset voice acquisition position to obtain the signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result; and acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy.
In one embodiment, the computer program when executed by the processor further performs the steps of: extracting the characteristics of the environment image to obtain the shape characteristics of the environment image; detecting an environment object from the environment image according to the shape feature; extracting texture features of the environment object from the environment image; and classifying and identifying the texture features to obtain the material type corresponding to the environment object.
In one embodiment, the computer program when executed by the processor further performs the steps of: calculating a sound reflection coefficient matched with the material type according to the sound absorption coefficient; calculating to obtain attenuation signals of each preset voice acquisition position of the acquisition room according to the voice reflection coefficient and the position coordinates of the voice acquisition device; acquiring sound velocity, and calculating to obtain transmission delay of each preset voice acquisition position of the acquisition room according to the position coordinates and the sound velocity of the voice acquisition device; and calculating the attenuation signals of the preset voice acquisition positions and the transmission delay of the preset voice acquisition positions based on a mirror image sound source method to obtain the indoor impulse response of the preset voice acquisition positions.
In one embodiment, the computer program when executed by the processor further performs the steps of: when the acoustic environment evaluation result meets a preset condition, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result; generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting the user to move to a target acquisition position; and acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of the user at the target acquisition position based on the target acquisition mode.
In one embodiment, the computer program when executed by the processor further performs the steps of: when the acoustic environment evaluation result does not meet the preset condition, generating a second prompt message according to the acoustic environment evaluation result; the second prompt message is used for prompting the user to adjust the environment of the acquisition room; after the user adjusts the environment of the acquisition room, returning to execute the acquisition of the environment image corresponding to the acquisition room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
In one embodiment, the preset conditions are that the signal-to-noise ratio of each preset voice collection position in a preset range with the voice collection device as the center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value.
In one embodiment, the computer program when executed by the processor further performs the steps of: collecting the sound of a collection chamber through a voice collection device to obtain a sound electric signal corresponding to the sound; carrying out voice activation detection on the sound electric signal, and extracting an environmental noise electric signal from the sound electric signal; and acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain the environmental noise signal of the acquisition room.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A speech acquisition method based on acoustic environment assessment, the method comprising:
acquiring an environment image corresponding to an acquisition room, and detecting the environment image to obtain an environment object and a material type corresponding to the environment object;
determining the distance of the environment object relative to a voice acquisition device according to the spatial relationship characteristics of the environment image, determining the position coordinate of the voice acquisition device in the environment object according to the distance, acquiring the sound absorption coefficient matched with the material type, and calculating the indoor impulse response of each preset voice acquisition position of the acquisition room according to the sound absorption coefficient and the position coordinate of the voice acquisition device; the indoor impulse response carries a response time;
dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position;
acquiring an environmental noise signal of the acquisition room, calculating according to the voice signal of each preset voice acquisition position and the environmental noise signal to obtain a signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position to obtain a signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result;
acquiring a target acquisition strategy matched with the acoustic environment evaluation result, and acquiring voice according to the target acquisition strategy, wherein the method comprises the following steps: when the acoustic environment evaluation result meets a preset condition, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result; generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting a user to move to the target acquisition position; acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of a user at the target acquisition position based on the target acquisition mode; the preset conditions are that the signal-to-noise ratio of each preset voice acquisition position in a preset range with the voice acquisition device as the center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value.
2. The method according to claim 1, wherein the detecting the environment image to obtain an environment object and a material type corresponding to the environment object includes:
performing feature extraction on the environment image to obtain shape features of the environment image;
detecting an environment object from the environment image according to the shape feature;
extracting texture features of the environment object from the environment image;
and classifying and identifying the texture features to obtain the material type corresponding to the environment object.
3. The method of claim 2, wherein detecting an environmental object from the environmental image according to the shape feature comprises:
inputting the shape features into a pre-trained first classifier;
and classifying and identifying the shape features through the first classifier to obtain the environment object.
4. The method according to claim 2, wherein the classifying and identifying the texture features to obtain a material type corresponding to the environment object comprises:
inputting the texture features into a pre-trained second classifier;
and classifying and identifying the texture features through the second classifier to obtain the material type corresponding to the environment object.
5. The method of claim 1, wherein the calculating indoor impulse responses of the preset voice capture positions of the capture room from the sound absorption coefficients and the position coordinates of the voice capture devices comprises:
calculating a sound reflection coefficient matched with the material type according to the sound absorption coefficient;
calculating to obtain attenuation signals of all preset voice acquisition positions of the acquisition room according to the sound reflection coefficient and the position coordinates of the voice acquisition device;
acquiring sound velocity, and calculating to obtain transmission delay of each preset voice acquisition position of the acquisition chamber according to the position coordinates of the voice acquisition device and the sound velocity;
and calculating the attenuation signals of the preset voice acquisition positions and the transmission delay of the preset voice acquisition positions based on a mirror image sound source method to obtain the indoor impulse response of the preset voice acquisition positions.
6. The method of claim 1, further comprising:
when the acoustic environment evaluation result does not meet the preset condition, generating a second prompt message according to the acoustic environment evaluation result; the second prompt message is used for prompting a user to adjust the environment of the collection room;
after the user adjusts the environment of the acquisition room, returning to execute the acquisition of the environment image corresponding to the acquisition room; and detecting the environment image to obtain an environment object and a material type corresponding to the environment object.
7. The method of claim 1, wherein said acquiring an ambient noise signal of said collection chamber comprises:
collecting the sound of the collection chamber through a voice collection device to obtain a sound electric signal corresponding to the sound;
extracting an environmental noise electrical signal from the sound electrical signal by performing voice activation detection on the sound electrical signal;
and acquiring the sensitivity of the voice acquisition device, and calculating the sensitivity and the environmental noise electric signal to obtain an environmental noise signal of the acquisition room.
8. A speech acquisition device based on acoustic environment assessment, the device comprising:
the image detection module is used for acquiring an environment image corresponding to the acquisition room, and detecting the environment image to obtain an environment object and a material type corresponding to the environment object;
the indoor impulse response calculation module is used for determining the distance between the environment object and the voice acquisition device according to the spatial relation characteristics of the environment image, determining the position coordinates of the voice acquisition device in the environment object according to the distance, acquiring the sound absorption coefficient matched with the material type, and calculating the indoor impulse response of each preset voice acquisition position of the acquisition room according to the sound absorption coefficient and the position coordinates of the voice acquisition device; the indoor impulse response carries a response time; dividing the indoor impulse response of each preset voice acquisition position according to the response time to obtain a voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position;
the acoustic environment evaluation module is used for acquiring an environmental noise signal of the acquisition room, calculating according to the voice signal of each preset voice acquisition position and the environmental noise signal to obtain a signal-to-noise ratio of each preset voice acquisition position, calculating according to the voice signal of each preset voice acquisition position and a reverberation signal of each preset voice acquisition position to obtain a signal-to-mixing ratio of each preset voice acquisition position, and evaluating the acoustic environment of the acquisition room according to the signal-to-noise ratio and the signal-to-mixing ratio of each preset voice acquisition position to obtain an acoustic environment evaluation result;
the voice acquisition module is used for acquiring a target acquisition strategy matched with the acoustic environment evaluation result and acquiring voice according to the target acquisition strategy, and comprises: when the acoustic environment evaluation result meets a preset condition, determining a target acquisition position corresponding to the acquisition room according to the acoustic environment evaluation result; generating a first prompt message according to the target acquisition position; the first prompt message is used for prompting a user to move to the target acquisition position; acquiring a target acquisition mode corresponding to the target acquisition position, and acquiring voice of a user at the target acquisition position based on the target acquisition mode; the preset conditions are that the signal-to-noise ratio of each preset voice acquisition position in a preset range with the voice acquisition device as the center is higher than a first threshold value and the signal-to-noise ratio is higher than a second threshold value.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010913557.XA 2020-09-03 2020-09-03 Voice acquisition method, device, equipment and medium based on acoustic environment evaluation Active CN111766303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010913557.XA CN111766303B (en) 2020-09-03 2020-09-03 Voice acquisition method, device, equipment and medium based on acoustic environment evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010913557.XA CN111766303B (en) 2020-09-03 2020-09-03 Voice acquisition method, device, equipment and medium based on acoustic environment evaluation

Publications (2)

Publication Number Publication Date
CN111766303A CN111766303A (en) 2020-10-13
CN111766303B true CN111766303B (en) 2020-12-11

Family

ID=72729246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010913557.XA Active CN111766303B (en) 2020-09-03 2020-09-03 Voice acquisition method, device, equipment and medium based on acoustic environment evaluation

Country Status (1)

Country Link
CN (1) CN111766303B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116489572A (en) * 2022-01-14 2023-07-25 华为技术有限公司 Electronic equipment control method and device and electronic equipment
CN115273795A (en) * 2022-06-22 2022-11-01 腾讯科技(深圳)有限公司 Method and device for generating analog impulse response and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104568749A (en) * 2013-10-25 2015-04-29 ***通信集团公司 Objective surface material identification method, device and identification equipment and system
CN106537501A (en) * 2014-10-22 2017-03-22 谷歌公司 Reverberation estimator

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104568749A (en) * 2013-10-25 2015-04-29 ***通信集团公司 Objective surface material identification method, device and identification equipment and system
CN106537501A (en) * 2014-10-22 2017-03-22 谷歌公司 Reverberation estimator

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A methodology for the study of the acoustic environment of Catholic cathedrals: Application to the Cathedral of Malaga;Lidia Alvarez-Morales et al.;《Building and Environment》;20140228;第72卷;102-115 *
Subjective and objective assessment of acoustical and overall environmental quality in secondary school classrooms;Arianna Astolfi et al.;《J. Acoust. Soc. Am.》;20080131;第123卷(第1期);163-164 *

Also Published As

Publication number Publication date
CN111766303A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
US11398235B2 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals based on horizontal and pitch angles and distance of a sound source relative to a microphone array
JP6030184B2 (en) Touchless sensing and gesture recognition using continuous wave ultrasound signals
Shih et al. Occupancy estimation using ultrasonic chirps
JP6042858B2 (en) Multi-sensor sound source localization
CN110933558B (en) Directional sounding method and device, ultrasonic transducer and electronic equipment
US20200245089A1 (en) An audio communication system and method
CN111766303B (en) Voice acquisition method, device, equipment and medium based on acoustic environment evaluation
US20240087587A1 (en) Wearable system speech processing
WO2013157254A1 (en) Sound detecting apparatus, sound detecting method, sound feature value detecting apparatus, sound feature value detecting method, sound section detecting apparatus, sound section detecting method, and program
US10582117B1 (en) Automatic camera control in a video conference system
US10753906B2 (en) System and method using sound signal for material and texture identification for augmented reality
CN109284081B (en) Audio output method and device and audio equipment
CN111124108B (en) Model training method, gesture control method, device, medium and electronic equipment
CN111930336A (en) Volume adjusting method and device of audio device and storage medium
WO2019239667A1 (en) Sound-collecting device, sound-collecting method, and program
US20230333205A1 (en) Sound source positioning method and apparatus
Zhou et al. Multi-modal face authentication using deep visual and acoustic features
US20160073208A1 (en) Acoustic Characterization Based on Sensor Profiling
CN110572600A (en) video processing method and electronic equipment
CN107543569B (en) Space disturbance detection method and device based on frequency modulation sound waves
CN110610706A (en) Sound signal acquisition method and device, electrical equipment control method and electrical equipment
CN105208283A (en) Soundsnap method and device
McKerrow et al. Classifying still faces with ultrasonic sensing
Wang et al. Real-time automated video and audio capture with multiple cameras and microphones
Tanigawa et al. Invisible-to-Visible: Privacy-Aware Human Segmentation using Airborne Ultrasound via Collaborative Learning Probabilistic U-Net

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant