CN113270111A

CN113270111A - Height prediction method, device, equipment and medium based on audio data

Info

Publication number: CN113270111A
Application number: CN202110536777.XA
Authority: CN
Inventors: 吴建花; 李南南; 张乔石; 余魏
Original assignee: Guangzhou Speakin Intelligent Technology Co ltd
Current assignee: Guangzhou Speakin Intelligent Technology Co ltd
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-08-17

Abstract

The application discloses a height prediction method, a height prediction device, height prediction equipment and a height prediction medium based on audio data, which are used for providing a method for predicting the height through sound data and providing an efficient clue troubleshooting means for public security officers. The method comprises the following steps: acquiring audio data to be predicted; extracting the Mel frequency cepstrum coefficient characteristics of the audio data to be predicted; inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to the audio data to be predicted.

Description

Height prediction method, device, equipment and medium based on audio data

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a height prediction method, apparatus, device, and medium based on audio data.

Background

With the biological equipment technology as an important field of a new generation of artificial intelligence, the important research direction of identity recognition is carried out by means of human physiological characteristics or behavior characteristics. In recent years, due to the rapid development of information technologies such as cloud computing, big data, internet of things and deep learning, the biological recognition technology makes continuous breakthrough in the aspects of basic theory, algorithm model, innovation application and the like.

Voiceprints are a common feature in biological recognition features, are widely applied to the field of voice processing, and can be used for gender recognition, age prediction, identity recognition and the like. In a speech monitoring application scenario, estimating speaker data from audio data is a key link for generating biometric evidence. However, the prior art does not provide a method for height prediction by means of sound data. Therefore, it is an urgent technical problem to be solved by those skilled in the art to provide a method for predicting height based on sound data.

Disclosure of Invention

The application provides a height prediction method, a height prediction device, height prediction equipment and a height prediction medium based on audio data, which are used for providing a height prediction method through sound data.

In view of the above, a first aspect of the present application provides a height prediction method based on audio data, including:

acquiring audio data to be predicted;

extracting the Mel frequency cepstrum coefficient characteristics of the audio data to be predicted;

and inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, and inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to the audio data to be predicted.

Optionally, the extracting the mel-frequency cepstrum coefficient feature of the audio data to be predicted further includes:

and carrying out normalization processing on the Mel frequency cepstrum coefficient characteristics.

Optionally, the preset gaussian mixture general background model includes a first preset gaussian mixture general background model and a second preset gaussian mixture general background model, and the preset support vector machine regression model includes a first preset support vector machine regression model and a second preset support vector machine regression model;

the method comprises the following steps of inputting the Mel frequency cepstrum coefficient features into a preset Gaussian mixture general background model for feature extraction to obtain feature vectors, inputting the feature vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to the audio data to be predicted, and the method also comprises the following steps:

judging the gender of the voice in the audio data to be predicted;

the method for obtaining the height prediction result corresponding to the audio data to be predicted comprises the following steps of inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain the height prediction result corresponding to the audio data to be predicted, and the method comprises the following steps:

when the gender of the voice in the audio data to be predicted is female, inputting the Mel frequency cepstrum coefficient characteristics into the first preset Gaussian mixture general background model for characteristic extraction to obtain a first characteristic vector, and inputting the first characteristic vector into the first preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted;

and when the gender of the voice in the audio data to be predicted is male, inputting the Mel frequency cepstrum coefficient characteristics into the second preset Gaussian mixture general background model for characteristic extraction to obtain a second characteristic vector, and inputting the second characteristic vector into the second preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted.

Optionally, the configuration process of the preset gaussian mixture general background model is as follows:

acquiring a plurality of female background audio data and male background audio data, and respectively training a Gaussian mixture general background model through the female audio data and the male audio data to obtain a first Gaussian mixture general background model and a second Gaussian mixture general background model;

acquiring audio data to be trained, and dividing the audio data according to the voice and the gender in the audio data to be trained to obtain female audio data to be trained and male audio data to be trained, wherein the audio data to be trained is provided with a height label;

respectively extracting Mel frequency cepstrum coefficient characteristics of the female audio data to be trained and the male audio data to be trained;

inputting the Mel frequency cepstrum coefficient characteristics of the female audio data to be trained into the first Gaussian mixture general background model for training to obtain the first preset Gaussian mixture general background model;

and inputting the Mel frequency cepstrum coefficient characteristics of the male audio data to be trained into the second Gaussian mixture general background model for training to obtain the second preset Gaussian mixture general background model.

Optionally, the configuration process of the preset support vector machine regression model is as follows:

inputting the Mel frequency cepstrum coefficient characteristics of the female audio data to be trained into the first Gaussian mixture general background model for characteristic extraction to obtain the characteristic vector of the female audio data to be trained;

inputting the Mel frequency cepstrum coefficient characteristics of the male audio data to be trained into the second Gaussian mixture general background model for characteristic extraction to obtain the characteristic vector of the male audio data to be trained;

inputting the feature vector of the female audio data to be trained into a first support vector machine regression model for supervised training to obtain a first preset support vector machine regression model;

and inputting the feature vector of the male audio data to be trained into a second support vector machine regression model for supervised training to obtain the second preset support vector machine regression model.

A second aspect of the present application provides a height prediction apparatus based on audio data, comprising:

an acquisition unit configured to acquire audio data to be predicted;

the characteristic extraction unit is used for extracting the Mel frequency cepstrum coefficient characteristic of the audio data to be predicted;

and the prediction unit is used for inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to the audio data to be predicted.

Optionally, the method further includes:

and the processing unit is used for carrying out normalization processing on the Mel frequency cepstrum coefficient characteristics.

Optionally, the preset gaussian mixture general background model includes a first preset gaussian mixture general background model and a second preset gaussian mixture general background model, the preset support vector machine regression model includes a first preset support vector machine regression model and a second preset support vector machine regression model, and the apparatus further includes:

the judging unit is used for judging the gender of the human voice in the audio data to be predicted;

the prediction unit is specifically configured to:

A third aspect of the application provides a height prediction device based on audio data, the device comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the method for height prediction based on audio data according to any of the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer-readable storage medium for storing program code for executing the method for audio data based height prediction according to any of the first aspects.

According to the technical scheme, the method has the following advantages:

the application provides a height prediction method based on audio data, which comprises the following steps: acquiring audio data to be predicted; extracting the Mel frequency cepstrum coefficient characteristics of the audio data to be predicted; inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to the audio data to be predicted.

According to the method, after the audio data to be predicted are obtained, the Mel frequency cepstrum coefficient characteristics are extracted and input into a preset Gaussian mixture general background model for characteristic extraction, characteristic vectors are obtained, then the characteristic vectors are input into a preset support vector machine regression model for height prediction, height prediction results corresponding to the audio data to be predicted are obtained, and the method for height prediction through sound data is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a height prediction method based on audio data according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating a height prediction method based on audio data according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a height prediction apparatus based on audio data according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, please refer to fig. 1, an embodiment of a height prediction method based on audio data provided by the present application includes:

step 101, audio data to be predicted are obtained.

The voice characteristics, the vocal tract characteristics and the pronunciation habits of each person in the speaking process are almost unique, and the acoustic model is established for the voice characteristics which can represent and identify the speaker in the voice, so that the research and development can be carried out on the aspects of biological characteristic estimation such as speaker identity information identification, height, age and the like. It has been found that the higher the height of the person, the lower respiratory tract is usually larger, and the extra space including the lungs creates a more muffled sound, and as the height increases, the frequency of the sound emitted from the airway in the lungs decreases significantly, so that taller persons tend to have lower pitch. Therefore, the voice contains the language content and identity information of the speaker and the accessory language information such as height, age, sex, emotion and the like, so that the height of the speaker can be predicted through the voice.

According to the embodiment of the application, the audio data to be predicted are obtained through the audio acquisition equipment, the recording equipment and the like.

And 102, extracting the Mel frequency cepstrum coefficient characteristics of the audio data to be predicted.

After the audio data to be predicted is obtained, Mel-Frequency Cepstral Coefficient Features (Mel-Frequency Cepstral Coefficient Features) of the audio data to be predicted are extracted, and after the Mel-Frequency Cepstral Coefficient Features are extracted, normalization processing can be carried out on the Mel-Frequency Cepstral Coefficient Features. The extraction process of the mel-frequency cepstrum coefficient features belongs to the prior art, and is not described herein again.

103, inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, and inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to the audio data to be predicted.

The embodiment of the application inputs the normalized Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, and inputs the characteristic vectors into a preset support vector machine regression model for height prediction to obtain height prediction results corresponding to audio data to be predicted.

The above is an embodiment of a height prediction method based on audio data provided by the present application, and the following is another embodiment of a height prediction method based on audio data provided by the present application.

Step 201, audio data to be predicted is obtained.

Step 202, extracting mel frequency cepstrum coefficient characteristics of the audio data to be predicted.

The specific processes of steps 201 to 202 are the same as those of steps 101 to 102, and are not described herein again.

And step 203, judging the gender of the voice in the audio data to be predicted.

The gender of the voice in the audio data to be predicted can be judged through the gender recognition model, the network model is trained through the audio data containing female and male to obtain the gender recognition model, and then the gender of the voice in the audio data to be predicted is detected through the gender recognition model.

And 204, when the gender of the voice in the audio data to be predicted is female, inputting the Mel frequency cepstrum coefficient characteristics into a first preset Gaussian mixture general background model for characteristic extraction to obtain a first characteristic vector, and inputting the first characteristic vector into a first preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted.

And step 205, when the gender of the voice in the audio data to be predicted is male, inputting the Mel frequency cepstrum coefficient characteristics into a second preset Gaussian mixture general background model for characteristic extraction to obtain a second characteristic vector, and inputting the second characteristic vector into a second preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted.

The preset Gaussian mixture general background model in the embodiment of the application comprises a first preset Gaussian mixture general background model and a second preset Gaussian mixture general background model, and the preset support vector machine regression model comprises a first preset support vector machine regression model and a second preset support vector machine regression model.

When the gender of the voice in the audio data to be predicted is female, inputting the Mel frequency cepstrum coefficient characteristics into a first preset Gaussian mixture general background model for characteristic extraction to obtain a first characteristic vector, and inputting the first characteristic vector into a first preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted.

When the gender of the voice in the audio data to be predicted is male, inputting the Mel frequency cepstrum coefficient characteristics into a second preset Gaussian mixture general background model for characteristic extraction to obtain a second characteristic vector, and inputting the second characteristic vector into a second preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted.

Further, the configuration process of the preset gaussian mixture general background model in the embodiment of the present application is as follows:

a1, acquiring a plurality of female background audio data and male background audio data, and respectively training a Gaussian mixture general background model through the female audio data and the male audio data to obtain a first Gaussian mixture general background model and a second Gaussian mixture general background model;

in order to improve the accuracy of the height prediction result, a model is trained for the male and female audio data respectively, so that the height prediction is performed on the male and female audio data respectively. Acquiring a plurality of female background audio data and male background audio data, respectively training a Gaussian mixture general background model through the female audio data and the male audio data, and obtaining a first Gaussian mixture general background model and a second Gaussian mixture general background model, wherein the network structures of the first Gaussian mixture general background model and the second Gaussian mixture general background model are consistent, and only the trained network parameters are different.

According to the embodiment of the application, a general background model is trained in advance through the female background audio data and the male background audio data, and then the audio data to be trained is subjected to targeted training, so that the problem that the data volume of the audio data to be trained is insufficient can be solved, and the generalization capability of the model is improved.

A2, obtaining audio data to be trained, and dividing the audio data according to the voice and the sex in the audio data to be trained to obtain audio data to be trained for females and audio data to be trained for males, wherein the audio data to be trained has height labels;

the audio data of a large number of known objects (including males and females) can be collected, and then height labeling is carried out on each audio data to obtain the audio data to be trained. And then, dividing according to the voice and gender in the audio data to be trained to obtain the audio data to be trained for females and the audio data to be trained for males.

A3, respectively extracting Mel frequency cepstrum coefficient characteristics of the audio data to be trained of the female and the audio data to be trained of the male;

a4, inputting Mel frequency cepstrum coefficient characteristics of female audio data to be trained into a first Gaussian mixture general background model for training to obtain a first preset Gaussian mixture general background model;

a5, inputting the Mel frequency cepstrum coefficient characteristics of the male audio data to be trained into a second Gaussian mixture general background model for training to obtain a second preset Gaussian mixture general background model.

Further, the configuration process of the preset support vector machine regression model in the embodiment of the present application is as follows:

b1, inputting Mel frequency cepstrum coefficient characteristics of the female audio data to be trained into a first Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors of the female audio data to be trained;

b2, inputting the Mel frequency cepstrum coefficient characteristics of the male audio data to be trained into a second Gaussian mixture general background model for characteristic extraction to obtain the characteristic vector of the male audio data to be trained;

b3, inputting the feature vector of the audio data to be trained of the female into a first support vector machine regression model for supervised training to obtain a first preset support vector machine regression model;

and B4, inputting the feature vector of the male audio data to be trained into a second support vector machine regression model for supervised training to obtain a second preset support vector machine regression model.

When the support vector machine regression model is trained, calculating a loss value according to a height prediction result corresponding to audio data to be trained and the real height, updating parameters of the support vector machine regression model according to the loss value until the support vector machine regression model converges to obtain the trained support vector machine regression model, and taking the trained support vector machine regression model as a preset support vector machine regression model.

In the embodiment of the application, after the audio data to be predicted is obtained, the Mel frequency cepstrum coefficient characteristics are extracted and input into a preset Gaussian mixture general background model for characteristic extraction, so that the characteristic vector is obtained, then the characteristic vector is input into a preset support vector machine regression model for height prediction, so that a height prediction result corresponding to the audio data to be predicted is obtained, and the method for height prediction through the sound data is realized.

Furthermore, in the embodiment of the application, a general background model is trained in advance through the female background audio data and the male background audio data, and then the audio data to be trained is subjected to targeted training, so that the problem of insufficient data volume of the audio data to be trained can be solved, and the generalization capability of the model is improved; the support vector machine is trained through the audio data to be trained of the female and the audio data to be trained of the male respectively to predict the height of the male and the female separately, so that the support vector machine can learn the mapping relation between the vocal print characteristics of the female and the height characteristics in a targeted manner, and the mapping relation between the vocal print characteristics of the male and the height characteristics, and the height prediction accuracy can be improved.

The above is another embodiment of the method for height prediction based on audio data provided by the present application, and the following is an embodiment of the apparatus for height prediction based on audio data provided by the present application.

Referring to fig. 3, an embodiment of the present invention provides a height prediction apparatus based on audio data, including:

an acquisition unit configured to acquire audio data to be predicted;

and the prediction unit is used for inputting the Mel frequency cepstrum coefficient characteristics into a preset Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors, inputting the characteristic vectors into a preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted.

As a further improvement, the method further comprises the following steps:

As a further improvement, the preset gaussian mixture general background model includes a first preset gaussian mixture general background model and a second preset gaussian mixture general background model, the preset support vector machine regression model includes a first preset support vector machine regression model and a second preset support vector machine regression model, and the apparatus further includes:

the prediction unit is specifically configured to:

when the gender of the voice in the audio data to be predicted is female, inputting Mel frequency cepstrum coefficient characteristics into a first preset Gaussian mixture general background model for characteristic extraction to obtain a first characteristic vector, and inputting the first characteristic vector into a first preset support vector machine regression model for height prediction to obtain a height prediction result corresponding to the audio data to be predicted;

As a further improvement, the configuration process of the preset gaussian mixture general background model is as follows:

respectively extracting Mel frequency cepstrum coefficient characteristics of the audio data to be trained of the female and the audio data to be trained of the male;

inputting the Mel frequency cepstrum coefficient characteristics of female audio data to be trained into a first Gaussian mixture general background model for training to obtain a first preset Gaussian mixture general background model;

and inputting the Mel frequency cepstrum coefficient characteristics of the male audio data to be trained into a second Gaussian mixture general background model for training to obtain a second preset Gaussian mixture general background model.

As a further improvement, the configuration process of the preset support vector machine regression model is as follows:

inputting the Mel frequency cepstrum coefficient characteristics of the female audio data to be trained into a first Gaussian mixture general background model for characteristic extraction to obtain the characteristic vector of the female audio data to be trained;

inputting the Mel frequency cepstrum coefficient characteristics of the male audio data to be trained into a second Gaussian mixture general background model for characteristic extraction to obtain characteristic vectors of the male audio data to be trained;

inputting the feature vector of the audio data to be trained of the female into a first support vector machine regression model for supervised training to obtain a first preset support vector machine regression model;

and inputting the feature vector of the male audio data to be trained into a second support vector machine regression model for supervised training to obtain a second preset support vector machine regression model.

The embodiment of the application also provides height prediction equipment based on the audio data, which comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is configured to execute the method for height prediction based on audio data in the aforementioned method embodiment according to instructions in the program code.

The embodiment of the application also provides a computer-readable storage medium for storing program codes, wherein the program codes are used for executing the height prediction method based on the audio data in the embodiment of the method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for height prediction based on audio data, comprising:

acquiring audio data to be predicted;

2. The method for height prediction based on audio data according to claim 1, wherein said extracting the mel-frequency cepstrum coefficient features of the audio data to be predicted further comprises:

3. The audio-data-based height prediction method according to claim 1, wherein the preset gaussian-mixed general background model comprises a first preset gaussian-mixed general background model and a second preset gaussian-mixed general background model, and the preset support vector machine regression model comprises a first preset support vector machine regression model and a second preset support vector machine regression model;

judging the gender of the voice in the audio data to be predicted;

4. The method of claim 3, wherein the preset Gaussian mixture general background model is configured by the following steps:

5. The method of claim 4, wherein the configuration process of the regression model of the preset support vector machine is as follows:

6. A height prediction apparatus based on audio data, comprising:

an acquisition unit configured to acquire audio data to be predicted;

7. The audio-data-based height prediction apparatus according to claim 6, further comprising:

8. The apparatus for height prediction based on audio data according to claim 6, wherein the preset Gaussian mixture common background model comprises a first preset Gaussian mixture common background model and a second preset Gaussian mixture common background model, the preset support vector machine regression model comprises a first preset support vector machine regression model and a second preset support vector machine regression model, the apparatus further comprising:

the prediction unit is specifically configured to:

9. A height prediction device based on audio data, characterized in that the device comprises a processor and a memory;

the processor is configured to execute the audio data based height prediction method of any of claims 1-5 according to instructions in the program code.

10. A computer-readable storage medium for storing program code for executing the audio data based height prediction method of any one of claims 1-5.