CN111489740A - Voice processing method and device and elevator control method and device - Google Patents

Voice processing method and device and elevator control method and device Download PDF

Info

Publication number
CN111489740A
CN111489740A CN202010325555.9A CN202010325555A CN111489740A CN 111489740 A CN111489740 A CN 111489740A CN 202010325555 A CN202010325555 A CN 202010325555A CN 111489740 A CN111489740 A CN 111489740A
Authority
CN
China
Prior art keywords
voice
feature
speech
characteristic
elevator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010325555.9A
Other languages
Chinese (zh)
Inventor
许孝先
冯大航
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Shengzhi Wulian Technology Co., Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202010325555.9A priority Critical patent/CN111489740A/en
Publication of CN111489740A publication Critical patent/CN111489740A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a voice processing method and device and an elevator control method and device, wherein the voice processing method comprises the following steps: extracting a first voice feature of the voice to be processed; carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic; and acquiring a processing result of the voice to be processed based on the second voice characteristic. The embodiment of the invention can improve the performance of the network model in the voice processing process.

Description

Voice processing method and device and elevator control method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a voice processing method and device and an elevator control method and device.
Background
Natural language Processing (N L P) is the field of computer science, artificial intelligence, and linguistics concerned with the interaction between computers and human (Natural) languages.
In the process of performing speech processing, the same speech content and different volume result in different amplitudes of speech, so that the speech features also exhibit larger differences.
Disclosure of Invention
The embodiment of the invention provides a voice processing method and device and an elevator control method and device, and aims to solve the problems that in the prior art, voice amplitude values are different due to different volumes, so that voice characteristics are different greatly, and the performance of a network model is poor.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a speech processing method, where the method includes:
extracting a first voice feature of the voice to be processed;
carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic;
and acquiring a processing result of the voice to be processed based on the second voice characteristic.
In a second aspect, an embodiment of the present invention provides an elevator control method, including:
receiving target voice input by a user in a scene of using an elevator;
the voice processing method of the embodiment of the invention is adopted to carry out off-line intention recognition on the target voice to obtain first control information;
and controlling the elevator to execute a first operation corresponding to the first control information.
In a third aspect, an embodiment of the present invention provides a speech processing apparatus, where the speech processing apparatus includes:
the extraction module is used for extracting a first voice feature of the voice to be processed;
the separation module is used for carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic;
and the acquisition module is used for acquiring the processing result of the voice to be processed based on the second voice characteristic.
In a fourth aspect, an embodiment of the present invention provides an elevator control apparatus, including:
the first receiving module is used for receiving target voice input by a user in an elevator using scene;
the recognition module is used for performing offline intention recognition on the target voice by adopting the voice processing method of the embodiment of the invention to obtain first control information;
and the first control module is used for controlling the elevator to execute the first operation corresponding to the first control information.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, which program when executed by the processor performs the steps in the speech processing method according to the first aspect or which program when executed by the processor performs the steps in the elevator control method according to the second aspect.
In a sixth aspect, the embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the voice processing method according to the first aspect, or the computer program, when executed by a processor, implements the steps in the elevator control method according to the second aspect.
In the embodiment of the invention, a first voice feature of a voice to be processed is extracted; carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic; and acquiring a processing result of the voice to be processed based on the second voice characteristic. Therefore, the voice signals with the same voice content and different volumes can be characterized in that the voice amplitude has amplification times, and the difference of the voice characteristics caused by different volumes can be reduced by carrying out voice amplitude characteristic separation processing on the first voice characteristics, so that the performance of a network model can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a method for processing speech according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating network model learning according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention;
fig. 4 is one of schematic structural diagrams of an elevator control apparatus according to an embodiment of the present invention;
fig. 5 is a second schematic structural diagram of an elevator control apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a speech processing method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:
step 101, extracting a first voice feature of a voice to be processed.
Wherein the first speech feature may comprise a plurality of first feature values. The plurality of first feature values may be obtained based on a logarithm operation, and may be obtained by taking a logarithm based on a constant e, or may also be obtained by taking a logarithm based on another number, which is not limited in the embodiment of the present invention. The first speech feature may be a filter banks feature.
And 102, carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic.
Wherein the second speech feature may include a plurality of second feature values. The performing speech amplitude feature separation processing on the first speech feature to obtain a second speech feature may include: and performing feature averaging processing on the plurality of first feature values to obtain a voice amplitude feature value, wherein the voice amplitude feature value is used for representing the voice amplitude feature of the voice to be processed, and performing voice amplitude feature separation processing on each first feature value respectively on the basis of the voice amplitude feature value to obtain second feature values corresponding to each first feature value in the second voice feature respectively.
And 103, acquiring a processing result of the voice to be processed based on the second voice characteristic.
The processing result of the to-be-processed speech is obtained based on the second speech feature, which may be that speech recognition is performed based on the second speech feature to obtain a speech recognition result; or, performing voice translation based on the second voice feature to obtain a voice translation result; or, the second speech feature may be used in other usage scenarios to obtain a processing result of the speech to be processed, which is not limited in the embodiment of the present invention. The second voice characteristic can be input into the network model for training in the process of training the network model; alternatively, the second speech feature may be input to the network model for prediction during prediction using the network model.
In practical applications, for example, the first voice and the second voice are respectively subjected to voice processing, the second voice may have the same voice content as the first voice, the second voice may be obtained by amplifying the volume of the first voice by n times, and the first voice feature of the first voice and the first voice feature of the second voice may both be filter banks features. The plurality of first feature values of the first speech may be (a)1,a2,a3,…,ai) Because the second voice is obtained by amplifying the volume of the first voice by n times and the filter banks characteristic is obtained based on logarithm operation, a plurality of first characteristic values of the second voice are (a)1+ln(n),a2+ln(n),a3+ln(n),…,ai+ ln (n)). The speech amplitude feature value may be an average of the plurality of first feature values, and the average of the plurality of first feature values of the first speech is: a isavg=(a1+a2+a3+…+ai) And i, the average value of the plurality of first characteristic values of the second voice is: a isavg+ ln (n), a difference between each first feature value in the plurality of first feature values and the average value may be respectively calculated to obtain a second feature value corresponding to each first feature value, and the plurality of second feature values of the first speech obtained through calculation may be (a)1-aavg,a2-aavg,a3-aavg,…,ai-aavg) The plurality of second feature values of the second speech may be (a)1-aavg,a2-aavg,a3-aavg,…,ai-aavg)。
In practical application, a one-dimensional feature value can be added to the improved filter banks feature to characterize the speech amplitude feature of the speech to be processed, and the speech amplitude feature can represent the volume, so that the feature values of other dimensions are independent of the volume, for example, (a) can be added1-aavg,a2-aavg,a3-aavg,…,ai-aavg,aavg) As an improved filterbanks feature of the first speech, (a) may be1-aavg,a2-aavg,a3-aavg,…,ai-aavg,aavg+ ln (n)) as an improved filter banks feature for the second speech; alternatively, (a) may be1-aavg,a2-aavg,a3-aavg,…,ai-aavg) As an improved filter banks feature of the first speech, (a) may be1-aavg,a2-aavg,a3-aavg,…,ai-aavg) As an improved filter banks feature for the second speech. For the same voice content, the voice amplitude is changed, the front i dimension of the improved filter banks characteristic is a fixed value and cannot be changed, only the characteristic value of the last dimension representing the average value is changed, the front i dimension is a pure characteristic and has no relation with the amplitude value, and the learning of a network model is facilitated. Using the same network model, the improved filter banks feature may improve the performance by 3% to 10% over the filter banks feature during speech processing.
It should be noted that the volume of the speech is within a certain range, for example, the volume of the speech is greater than a preset value, the same speech content is obtained, and the result of the speech recognition is the same. The same effect of volume on each dimension of the feature vector of the speech feature reduces the efficiency of network model learning during speech processing. Taking the filter banks characteristic as an example, the filter banks characteristic can be obtained by sequentially performing Fourier transform, Mel filtering and logarithm operation on the voice to be processed, if the volume of the voice is amplified by n times, and n is a positive integer, the amplitude of the voice is amplified by n times, and the voice is correspondingly amplified by n times on a frequency spectrum through Fourier transform; maintaining the amplification by n times through Mel filtering; by taking the logarithm operation, the voice with n times of volume amplification is added with ln (n) in each dimension in the filter banks characteristic of the voice without amplification. As shown in fig. 2, for the network model, when the input speech content is a, but the filter banks characteristics of a plurality of speeches with different volumes are input, the same speech content is a, because the magnitudes of the speeches are different, a is translated on a straight line along with the magnitude influence, but when speech processing such as speech recognition is performed, the obtained result is a. Because the volumes are different, each dimension of the filter banks characteristic is translated in the same way, the voice amplitude has the same influence on each dimension, and the burden of a network model can be increased.
In the embodiment of the invention, a first voice feature of a voice to be processed is extracted; carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic; and acquiring a processing result of the voice to be processed based on the second voice characteristic. Therefore, the voice signals with the same voice content and different volumes can be characterized in that the voice amplitude has amplification times, and the difference of the voice characteristics caused by different volumes can be reduced by carrying out voice amplitude characteristic separation processing on the first voice characteristics, so that the performance of a network model can be improved.
Optionally, the first speech feature includes a plurality of first feature values, and the second speech feature includes a plurality of second feature values;
the performing speech amplitude feature separation processing on the first speech feature to obtain a second speech feature includes:
carrying out feature average processing on the plurality of first feature values to obtain a voice amplitude feature value, wherein the voice amplitude feature value is used for representing the voice amplitude feature of the voice to be processed;
and respectively carrying out voice amplitude characteristic separation processing on each first characteristic value based on the voice amplitude characteristic value to obtain second characteristic values corresponding to each first characteristic value in the second voice characteristic.
The plurality of first feature values may be obtained based on a logarithm operation, and may be obtained by taking a logarithm based on a constant e, or may also be obtained by taking a logarithm based on other numbers, which is not limited in the embodiment of the present invention.
In addition, the feature averaging processing on the plurality of first feature values may be performed by calculating an average value of the plurality of first feature values based on the plurality of first feature values, and the speech amplitude feature value may be an average value of the plurality of first feature values. The performing, based on the speech amplitude feature value, speech amplitude feature separation processing on each first feature value respectively to obtain the second feature value in the second speech feature, where the second feature value corresponds to each first feature value respectively, may be to obtain, based on an average value of each first feature value in the plurality of first feature values and the plurality of first feature values, the second feature value in the second speech feature, which corresponds to each first feature value respectively.
In addition, the obtaining the second feature value corresponding to each of the first feature values in the second speech feature based on each of the first feature values and an average value of the first feature values may include calculating a difference between each of the first feature values and the average value to obtain a second feature value corresponding to each of the first feature values; or, the method may include calculating a difference between each of the plurality of first feature values and the average value, and multiplying the difference by a first preset value to obtain a second feature value corresponding to each of the first feature values; alternatively, the method may include calculating a difference between each of the plurality of first feature values and the average value, and subtracting a second preset value from the difference to obtain a second feature value corresponding to each of the first feature values, and so on, which is not limited in the embodiment of the present invention.
Preferably, the obtaining the second feature value corresponding to each of the first feature values in the second speech feature based on each of the first feature values and an average value of the first feature values may include: and respectively calculating the difference value between each first characteristic value in the plurality of first characteristic values and the average value to obtain a second characteristic value corresponding to each first characteristic value. The second characteristic value of each voice signal is the same value in at least two voice signals with the same voice content and different volumes, so that the difference of voice characteristics caused by different volumes can be further reduced, and the performance of a network model can be improved.
Taking the first voice and the second voice as an example, the second voice may have the same voice content as the first voice, and the second voice may be obtained by amplifying the volume of the first voice by n times, the first feature value in the first voice feature of the second voice may be obtained based on a logarithm operation, and the first feature value in the first voice feature of the first voice may be obtained based on a logarithm operation, so that each first feature value in the first voice feature of the second voice is increased by ln (n) or log (n) compared with the corresponding first feature value in the first voice feature of the first voice.
Taking the first speech feature of the first speech and the first speech feature of the second speech as the filter banks feature, each first feature value in the first speech feature of the second speech is increased by ln (n) compared with the corresponding first feature value in the first speech feature of the first speech, and the plurality of first feature values of the first speech may be (a)1,a2,a3,…,ai) The plurality of first feature values of the second speech may be (a)1+ln(n),a2+ln(n),a3+ln(n),…,ai+ ln (n)), i is a positive integer. The average of the plurality of first feature values of the first speech may be: a isavg=(a1+a2+a3+…+ai) I, the average value of the plurality of first feature values of the second speech may be: a isavg+ ln (n), a difference between each first feature value in the plurality of first feature values and the average value may be respectively calculated to obtain a second feature value corresponding to each first feature value, and the plurality of second feature values of the first speech obtained through calculation may be (a)1-aavg,a2-aavg,a3-aavg,…,ai-aavg) The plurality of second feature values of the second speech may be (a)1-aavg,a2-aavg,a3-aavg,…,ai-aavg)。
In this embodiment, by performing feature averaging processing on the plurality of first feature values, it is possible to quickly and accurately perform speech amplitude feature separation processing on the first speech feature.
Optionally, the second speech feature further includes the speech amplitude feature value.
Wherein the speech amplitude characteristic value may be an average value of the plurality of first characteristic values. Taking a plurality of first characteristic values as (x)1,x2,x3,…,xk) The average value of the plurality of first characteristic values is xavgFor example, the second feature value corresponding to each of the first feature values may be: (x)1-xavg,x2-xavg,x3-xavg,…,xk-xavg,xavg). Can be prepared from (x)1-xavg,x2-xavg,x3-xavg,…,xk-xavg,xavg) And inputting the network model so as to further process the voice.
In this embodiment, the second speech feature further includes the speech amplitude feature value, and features related to volume in the speech signal can be separately extracted as a part of the speech feature, so that differences of the speech features due to different volumes can be reduced, and performance of the network model can be improved; and the voice amplitude characteristic value can also be used for distinguishing noise, noise and voice can be distinguished through the voice amplitude characteristic value, and under the condition that the noise needs to be used in the voice processing process, the effect of voice processing by the second voice characteristic comprising the voice amplitude characteristic value is better.
Optionally, the dimension of the second speech feature is greater than or equal to the dimension of the first speech feature.
Wherein the first speech feature may comprise a plurality of first feature values and the second speech feature may comprise a plurality of second feature values. The first feature values may correspond one-to-one to the second feature values such that the dimensionality of the second speech feature may be equal to the dimensionality of the first speech feature. The second voice feature can also comprise a voice amplitude feature value, the voice amplitude feature value is used for characterizing the voice amplitude feature of the voice to be processed, and therefore the dimension of the second voice feature is larger than that of the first voice feature. Further, the second speech feature may further include feature values for characterizing other features of the speech to be processed, which is not limited in this embodiment of the present invention.
In this embodiment, the dimension of the second speech feature is greater than or equal to the dimension of the first speech feature, so that more features of the speech to be processed can be obtained, and the speech processing effect can be further improved.
Optionally, the first speech feature includes a filter banks filter bank feature.
The filter banks characteristic, namely the Fbank characteristic, is a commonly used speech characteristic at present, the response of human ears to a sound frequency spectrum is nonlinear, the Fbank characteristic can simulate the human ears to process speech, and the Fbank characteristic is adopted to improve the performance of speech recognition in the speech recognition process. The Fbank feature can be obtained by performing fourier transform and mel filtering on the speech frame by frame, and then by taking the logarithm. In practical application, fourier transform may be performed on the voice to be processed to obtain frequency domain characteristics of the voice to be processed, mel filtering may be performed on the frequency domain characteristics of the voice to be processed to obtain a filtering result, and logarithm may be taken from the filtering result to obtain the Fbank characteristics of the voice to be processed.
Taking the first voice and the second voice as an example, the second voice may have the same voice content as the first voice, and the second voice may be obtained by amplifying the volume of the first voice by n times, and the first voice feature of the first voice and the first voice feature of the second voice may both be filter banks features. For example, after Fourier transforming and Mel filtering the first speech, (b) can be obtained1,b2,b3,…,bi) By taking the logarithm, the first speech feature (ln b) of the first speech can be obtained1,ln b2,lnb3,…,ln bi). After Fourier transform and Mel filtering of the second speech, (nb) is obtained1,nb2,nb3,…,nbi) By taking the logarithm, the first speech feature (ln b) of the second speech can be obtained1+ln(n),ln b2+ln(n),ln b3+ln(n),…,ln bi+ln(n))。
In this embodiment, the first speech features include filter banks features, which are sensitive to the volume of sound, and the volume has the same influence on each dimension of the feature vector of the filter banks features, and the speech amplitude features of the speech to be processed are separated, so that features related to the volume in the speech signal can be extracted, and the second speech features are obtained by calculation, thereby improving the filter banks features, reducing the difference of the speech features caused by different volumes, and improving the performance of the network model by performing speech processing on the improved filter banks features.
The embodiment of the invention also provides an elevator control method, which comprises the following steps:
receiving target voice input by a user in a scene of using an elevator;
the voice processing method of the embodiment of the invention is adopted to carry out off-line intention recognition on the target voice to obtain first control information;
and controlling the elevator to execute a first operation corresponding to the first control information.
The elevator control method can be applied to an elevator control device in an elevator, and is used for controlling the elevator to go to a certain floor or cancel the going to the certain floor, or controlling the elevator to open or close a door, and the elevator control device can also control the elevator to perform other operations, which is not limited in the embodiment of the invention. The elevator control device can receive the target voice input by the user in the following way: the elevator control device receives the input voice as the target voice after receiving the awakening word input by the user. Wherein, the awakening word can be set according to the requirement, for example, the awakening word can be 'hello, elevator'. Alternatively, the elevator control device may directly receive the input voice as the target voice.
In addition, the voice processing method according to the embodiment of the present invention performs offline intention recognition on the target voice to obtain first control information, which may be extracting a first voice feature of the target voice; carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic; and acquiring the first control information based on the second voice characteristic.
It should be noted that the first control information is obtained based on the second speech feature, the first control information may be obtained by performing offline intention recognition based on the second speech feature, and specifically, the first control information may be obtained by inputting the second speech feature into a network model for offline intention recognition. The method comprises the following steps of performing offline intention recognition based on the second voice characteristic to obtain first control information, wherein the first mode is to directly obtain the first control information according to the second voice characteristic, and the second mode is to convert target voice into text based on the second voice characteristic and obtain the first control information according to the text.
For example, in a scenario of performing elevator control on an elevator, the first control information may include a control instruction for controlling the elevator and a floor corresponding to the control instruction, where the control instruction may include a confirmation instruction for controlling the elevator to go to a certain floor, and may further include a cancellation instruction for cancelling an operation of the elevator to go to a certain floor. It should be noted that the first control information is an exemplary illustration, and the first control information may change according to an application scenario, which is not limited in this embodiment of the present invention.
In the first mode, the elevator control apparatus may store a voice command word bank in advance, the voice command word bank being used to store a plurality of voice command words, the voice characteristics of each voice command word may be stored, and one voice command word corresponds to one intention information. The offline intention recognition based on the second voice feature may be implemented in the following manner: and selecting a voice command word corresponding to the voice feature with the highest similarity with the second voice feature from the voice command word bank, and taking intention information corresponding to the voice command word as first control information.
In a second mode, the elevator control apparatus may store a text command word bank for storing a plurality of text command words, and one text command word corresponds to one intention information. The offline intention recognition based on the second voice feature may be implemented in the following manner: and acquiring a first text corresponding to the second voice characteristic, selecting a text command word with the highest similarity with the first text from a text command word library, and taking intention information corresponding to the text command word as first control information.
In the prior art, online voice recognition needs to be transmitted through a network, response speed is low, the online voice recognition is easily affected by network quality, and response delay is large under the condition of poor network quality, so that elevator control efficiency is low.
In the embodiment of the invention, target voice input by a user in an elevator using scene is received; the voice processing method of the embodiment of the invention is adopted to carry out off-line intention recognition on the target voice to obtain first control information; and controlling the elevator to execute a first operation corresponding to the first control information. The voice processing method of the embodiment of the invention can improve the recognition efficiency by performing the off-line intention recognition, and because the off-line recognition generally has a response speed higher than that of the on-line recognition, the first control information is obtained by performing the off-line intention recognition, the first operation corresponding to the first control information is executed, the response speed of elevator control is ensured, and the elevator control efficiency can be improved. The method is applied to the scene of controlling the elevator, so that the starting and running efficiency of the elevator can be greatly improved, and the user viscosity is improved.
Optionally, the method further includes:
sending the target voice to a server so that the server performs online intention recognition on the target voice;
receiving second control information sent by the server;
and if the second control information is inconsistent with the first control information, controlling the elevator to cancel the execution of the first operation and executing a second operation corresponding to the second control information.
The second control information is the same as the first control information, and is not described herein again. The server performs online intention recognition on the target voice to obtain second control information. The implementation manner of obtaining the second control information can be identified with the elevator control device for the offline intention, and the implementation manner of obtaining the first control information is the same, and is not described herein again. It should be noted that, because the speech command word bank and the text command word bank of the online intent recognition are stored in the cloud, the sample data therein is richer, and the success rate and the accuracy rate of the speech recognition are very high. For example, a speech command thesaurus can be used for controlling an elevator, and the speech command thesaurus for on-line intention recognition can comprise the speech command words 'go to a restaurant', and the intention information corresponding to the word 'go to the restaurant' can be 'confirm instruction-3 stories', so that the elevator control is more intelligent.
If the elevator control device determines that the second control information is consistent with the first control information, the elevator control device can ignore the second control information and continue to control the elevator to execute the first operation. If the elevator control device determines that the second control information is inconsistent with the first control information, the elevator control device can directly control the elevator to cancel the execution of the first operation and execute the second operation corresponding to the second control information, and the method is simple and high in efficiency.
Taking the first control information as "confirm command-3 th floor" as an example, the corresponding first operation is going to 3 th floor, the second control information can be "confirm command-5 th floor", the corresponding second operation is going to 5 th floor, and then the elevator can directly cancel the operation going to 3 th floor and go to 5 th floor.
In the embodiment, online intention recognition is carried out through the server, because the accuracy of online recognition is generally higher than that of offline recognition, if the second control information is inconsistent with the first control information, the elevator is controlled to cancel execution of the first operation and execute the second operation corresponding to the second control information, and the accuracy of elevator control is ensured.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the speech processing apparatus 200 includes:
an extraction module 201, configured to extract a first voice feature of a voice to be processed;
a separation module 202, configured to perform speech amplitude feature separation processing on the first speech feature to obtain a second speech feature;
an obtaining module 203, configured to obtain a processing result of the to-be-processed speech based on the second speech feature.
Optionally, the first speech feature includes a plurality of first feature values, and the second speech feature includes a plurality of second feature values;
the separation module 202 is specifically configured to:
carrying out feature average processing on the plurality of first feature values to obtain a voice amplitude feature value, wherein the voice amplitude feature value is used for representing the voice amplitude feature of the voice to be processed;
and respectively carrying out voice amplitude characteristic separation processing on each first characteristic value based on the voice amplitude characteristic value to obtain second characteristic values corresponding to each first characteristic value in the second voice characteristic.
Optionally, the second speech feature further includes the speech amplitude feature value.
Optionally, the dimension of the second speech feature is greater than or equal to the dimension of the first speech feature.
Optionally, the first speech feature includes a filter banks filter bank feature.
The speech processing apparatus can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an elevator control apparatus according to an embodiment of the present invention, and as shown in fig. 4, an elevator control apparatus 300 includes:
the first receiving module 301 is used for receiving target voice input by a user in an elevator using scene;
the recognition module 302 is configured to perform offline intention recognition on the target speech by using the speech processing method according to the embodiment of the present invention, so as to obtain first control information;
and the first control module 303 is used for controlling the elevator to execute a first operation corresponding to the first control information.
Optionally, as shown in fig. 5, the elevator control apparatus 300 further includes:
a sending module 304, configured to send the target voice to a server, so that the server performs online intent recognition on the target voice;
a second receiving module 305, configured to receive second control information sent by the server;
and a second control module 306, configured to control the elevator to cancel execution of the first operation and execute a second operation corresponding to the second control information if the second control information is inconsistent with the first control information.
The elevator control device can realize each process realized in the elevator control method in the embodiment of the invention, and is not described again in order to avoid repetition.
In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted mobile terminal, a wearable device, an elevator, and the like.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device 400 includes: a memory 402, a processor 401, and a program stored on the memory 402 and executable on the processor 401, wherein:
as an embodiment, the processor 401 reads a program in the memory 402 for executing:
extracting a first voice feature of the voice to be processed;
carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic;
and acquiring a processing result of the voice to be processed based on the second voice characteristic.
Optionally, the first speech feature includes a plurality of first feature values, and the second speech feature includes a plurality of second feature values;
the processor 401 is configured to perform speech amplitude feature separation processing on the first speech feature to obtain a second speech feature, and includes:
carrying out feature average processing on the plurality of first feature values to obtain a voice amplitude feature value, wherein the voice amplitude feature value is used for representing the voice amplitude feature of the voice to be processed;
and respectively carrying out voice amplitude characteristic separation processing on each first characteristic value based on the voice amplitude characteristic value to obtain second characteristic values corresponding to each first characteristic value in the second voice characteristic.
Optionally, the second speech feature further includes the speech amplitude feature value.
Optionally, the dimension of the second speech feature is greater than or equal to the dimension of the first speech feature.
Optionally, the first speech feature includes a filter banks filter bank feature.
As another embodiment, the processor 401 reads the program in the memory 402 to execute:
receiving target voice input by a user in a scene of using an elevator;
the voice processing method of the embodiment of the invention is adopted to carry out off-line intention recognition on the target voice to obtain first control information;
and controlling the elevator to execute a first operation corresponding to the first control information.
Optionally, the processor 401 is further configured to perform:
sending the target voice to a server so that the server performs online intention recognition on the target voice;
receiving second control information sent by the server;
and if the second control information is inconsistent with the first control information, controlling the elevator to cancel the execution of the first operation and executing a second operation corresponding to the second control information.
In FIG. 6, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 401, and various circuits, represented by memory 402, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.
The processor 401 is responsible for managing the bus architecture and general processing, and the memory 402 may store data used by the processor 401 in performing operations.
It should be noted that any implementation manner in the method embodiment of the present invention may be implemented by the electronic device in this embodiment, and achieve the same beneficial effects, and details are not described here.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements each process of the foregoing voice processing method embodiment, or the computer program, when executed by the processor, implements each process of the foregoing elevator control method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition. The computer-readable storage medium may be a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method of speech processing, the method comprising:
extracting a first voice feature of the voice to be processed;
carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic;
and acquiring a processing result of the voice to be processed based on the second voice characteristic.
2. The method of claim 1, wherein the first speech feature comprises a plurality of first feature values and the second speech feature comprises a plurality of second feature values;
the performing speech amplitude feature separation processing on the first speech feature to obtain a second speech feature includes:
carrying out feature average processing on the plurality of first feature values to obtain a voice amplitude feature value, wherein the voice amplitude feature value is used for representing the voice amplitude feature of the voice to be processed;
and respectively carrying out voice amplitude characteristic separation processing on each first characteristic value based on the voice amplitude characteristic value to obtain second characteristic values corresponding to each first characteristic value in the second voice characteristic.
3. The method of claim 2, wherein the second speech feature further comprises the speech magnitude feature value.
4. The method of claim 1, wherein the dimension of the second speech feature is greater than or equal to the dimension of the first speech feature.
5. The method of claim 1, wherein the first speech features comprise filter banks features.
6. An elevator control method, characterized in that the method comprises:
receiving target voice input by a user in a scene of using an elevator;
performing offline intention recognition on the target voice by adopting the voice processing method of any one of claims 1 to 5 to obtain first control information;
and controlling the elevator to execute a first operation corresponding to the first control information.
7. The method of claim 6, further comprising:
sending the target voice to a server so that the server performs online intention recognition on the target voice;
receiving second control information sent by the server;
and if the second control information is inconsistent with the first control information, controlling the elevator to cancel the execution of the first operation and executing a second operation corresponding to the second control information.
8. A speech processing apparatus, characterized in that the speech processing apparatus comprises:
the extraction module is used for extracting a first voice feature of the voice to be processed;
the separation module is used for carrying out voice amplitude characteristic separation processing on the first voice characteristic to obtain a second voice characteristic;
and the acquisition module is used for acquiring the processing result of the voice to be processed based on the second voice characteristic.
9. An elevator control device, characterized by comprising:
the first receiving module is used for receiving target voice input by a user in an elevator using scene;
the recognition module is used for performing offline intention recognition on the target voice by adopting the voice processing method of any one of claims 1 to 5 to obtain first control information;
and the first control module is used for controlling the elevator to execute the first operation corresponding to the first control information.
10. An electronic device, comprising: a memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the speech processing method according to any of claims 1 to 5; alternatively, the program realizes the steps in the elevator control method according to any one of claims 6 to 7 when executed by the processor.
CN202010325555.9A 2020-04-23 2020-04-23 Voice processing method and device and elevator control method and device Pending CN111489740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010325555.9A CN111489740A (en) 2020-04-23 2020-04-23 Voice processing method and device and elevator control method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010325555.9A CN111489740A (en) 2020-04-23 2020-04-23 Voice processing method and device and elevator control method and device

Publications (1)

Publication Number Publication Date
CN111489740A true CN111489740A (en) 2020-08-04

Family

ID=71813137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010325555.9A Pending CN111489740A (en) 2020-04-23 2020-04-23 Voice processing method and device and elevator control method and device

Country Status (1)

Country Link
CN (1) CN111489740A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571069A (en) * 2021-08-03 2021-10-29 北京房江湖科技有限公司 Information processing method, device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4829578A (en) * 1986-10-02 1989-05-09 Dragon Systems, Inc. Speech detection and recognition apparatus for use with background noise of varying levels
JP2005079781A (en) * 2003-08-29 2005-03-24 Nippon Telegr & Teleph Corp <Ntt> Method for separating blind signal, blind signal separation program and recording medium
WO2017081977A1 (en) * 2015-11-12 2017-05-18 三菱電機株式会社 Motor control device and elevator in which same is used
CN106935248A (en) * 2017-02-14 2017-07-07 广州孩教圈信息科技股份有限公司 A kind of voice similarity detection method and device
US20170294195A1 (en) * 2016-04-07 2017-10-12 Canon Kabushiki Kaisha Sound discriminating device, sound discriminating method, and computer program
CN107464567A (en) * 2017-07-24 2017-12-12 深圳云知声信息技术有限公司 Audio recognition method and device
CN110097884A (en) * 2019-06-11 2019-08-06 大众问问(北京)信息科技有限公司 A kind of voice interactive method and device
CN110890087A (en) * 2018-09-10 2020-03-17 北京嘉楠捷思信息技术有限公司 Voice recognition method and device based on cosine similarity

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4829578A (en) * 1986-10-02 1989-05-09 Dragon Systems, Inc. Speech detection and recognition apparatus for use with background noise of varying levels
JP2005079781A (en) * 2003-08-29 2005-03-24 Nippon Telegr & Teleph Corp <Ntt> Method for separating blind signal, blind signal separation program and recording medium
WO2017081977A1 (en) * 2015-11-12 2017-05-18 三菱電機株式会社 Motor control device and elevator in which same is used
US20170294195A1 (en) * 2016-04-07 2017-10-12 Canon Kabushiki Kaisha Sound discriminating device, sound discriminating method, and computer program
CN106935248A (en) * 2017-02-14 2017-07-07 广州孩教圈信息科技股份有限公司 A kind of voice similarity detection method and device
CN107464567A (en) * 2017-07-24 2017-12-12 深圳云知声信息技术有限公司 Audio recognition method and device
CN110890087A (en) * 2018-09-10 2020-03-17 北京嘉楠捷思信息技术有限公司 Voice recognition method and device based on cosine similarity
CN110097884A (en) * 2019-06-11 2019-08-06 大众问问(北京)信息科技有限公司 A kind of voice interactive method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林麒麟 等: "基于语音识别的电梯辅助控制***设计" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571069A (en) * 2021-08-03 2021-10-29 北京房江湖科技有限公司 Information processing method, device and storage medium

Similar Documents

Publication Publication Date Title
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
US11132518B2 (en) Method and apparatus for translating speech
CN111312245B (en) Voice response method, device and storage medium
US10810993B2 (en) Sample-efficient adaptive text-to-speech
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN112562691A (en) Voiceprint recognition method and device, computer equipment and storage medium
CN111402861A (en) Voice recognition method, device, equipment and storage medium
CN105654955B (en) Audio recognition method and device
CN113096647A (en) Voice model training method and device and electronic equipment
CN116797695A (en) Interaction method, system and storage medium of digital person and virtual whiteboard
CN114579718A (en) Text feature generation method, device, equipment and storage medium combining RPA and AI
CN113782030B (en) Error correction method based on multi-mode voice recognition result and related equipment
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
CN111489740A (en) Voice processing method and device and elevator control method and device
CN111400463A (en) Dialog response method, apparatus, device and medium
CN111554270B (en) Training sample screening method and electronic equipment
KR20230020508A (en) Remove text echo
CN113327594A (en) Speech recognition model training method, device, equipment and storage medium
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN111785302A (en) Speaker separation method and device and electronic equipment
US20230015112A1 (en) Method and apparatus for processing speech, electronic device and storage medium
JP7372402B2 (en) Speech synthesis method, device, electronic device and storage medium
CN115019787B (en) Interactive homonym disambiguation method, system, electronic equipment and storage medium
CN114171043B (en) Echo determination method, device, equipment and storage medium
CN115985320A (en) Intelligent device control method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201116

Address after: 266100 Room 2002, 20th Floor, Building 2, Darong Century Complex (Darong Center), 180 Haier Road, Laoshan District, Qingdao City, Shandong Province

Applicant after: Shandong Shengzhi Wulian Technology Co., Ltd

Address before: Room 306, floor 3, NO.67, Beisihuan West Road, Haidian District, Beijing 100098

Applicant before: BEIJING SOUNDAL TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right