CN111415680B

CN111415680B - Voice-based anxiety prediction model generation method and anxiety prediction system

Info

Publication number: CN111415680B
Application number: CN202010220713.4A
Authority: CN
Inventors: 冯甄陶
Original assignee: Xintu Entropy Technology Suzhou Co ltd
Current assignee: Xintu Entropy Technology Suzhou Co ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2023-05-23
Anticipated expiration: 2040-03-26
Also published as: CN111415680A

Abstract

The application provides a method for generating an anxiety prediction model based on voice and an anxiety prediction system, comprising the following steps: step 1: collecting the voice of a user reading text and the SAS scale score of the user, and marking the voice by using the score; step 2: and extracting the voice characteristics of the voice, and constructing an anxiety prediction model by utilizing a neural network. According to the invention, the automatic recognition of the anxiety state to be tested is realized by using the voice of the reading text and through machine learning, the system is convenient to operate and deploy, and the convenience of anxiety recognition and prediction is improved.

Description

Voice-based anxiety prediction model generation method and anxiety prediction system

Technical Field

The present invention relates to the field of psychology and artificial intelligence, and more particularly to a method and system for generating a speech-based anxiety prediction model.

Background

Anxiety disorder is a chronic disorder characterized by uncontrolled, excessive, extensive, persistent anxiety, also known as anxiety neurosis, characterized primarily by an anxiety emotional experience. The main manifestations are: there is no stress concern, restlessness, and autonomic dysfunction symptoms such as palpitations, tremble hands, sweating, frequent urination, etc., and restlessness. Anxiety disorders are not caused by a real threat or their degree of panic is quite disproportionate to reality. Drug treatment and psychological treatment such as anxiolytic are main treatment methods for anxiety disorder.

Anxiety disorder can be said to be the most common mood disorder in the population, and the "willow-psychiatry" issued "epidemiological current study on prevalence of chinese mental disorders", which indicated that: among various psychological and mental diseases, anxiety disorder has the highest prevalence rate, and the lifetime prevalence rate is 7.57%. It is estimated that more than about 5 tens of millions of anxiety patients exist nationally. World health organization indicates that 90% of anxiety patients develop before age 35, with women often being more than men. In recent years, anxiety patients have a continuously rising trend. According to the world health organization, about 4100 ten thousand people in China have anxiety disorder. Therefore, the identification and treatment of anxiety disorders is of great interest. It was found that although anxiety disorder was curable, only 36.9% of patients were treated. The biggest one of these disorders is the recognition of anxiety.

To date, there is no specific examination for anxiety disorders. At present, the diagnosis method of anxiety comprises the following steps: (1) Screening by self-reported scales and self-diagnosis, such as anxiety self-assessment (SAS) scales; (2) Diagnosis is made by specialists based on medical history, family history, clinical symptoms, course of disease and physical examination. The current assessment of anxiety symptoms is primarily by means of a self-reported scale. However, the self-reporting assessment takes a long time and depends on subjective coordination of the subject; the doctor synthesizes various information of the patient to make diagnosis, which takes more energy and longer time, and is easy to misdiagnose. Meanwhile, in the case where long-term monitoring of the state of focus is required, it is also not feasible to require the user to answer the same questions repeatedly and frequently. Therefore, the anxiety states of users are more convenient and objective, and the real-time evaluation needs seem to be more urgent.

Anxiety Self-assessment scale (i.e., SAS scale, self-Rating Anxiety Scale SAS) was compiled by the professor of chinese Zung (1971). The method from the form of scale construction to specific assessment is quite similar to the self-assessment of depression (SDS), and is a relatively simple clinical tool for analyzing subjective symptoms of patients. Since anxiety is a more common mood disorder in psychological consultation clinics, SAS has been a common scale for understanding anxiety symptoms in consultation clinics in recent years.

SAS uses a 4-level score, mainly evaluating the frequency of symptom occurrence, with the criteria: "1" means that there is no or little time; "2" means sometimes; "3" means most of the time; "4" means most or all of the time. The item l5 in the 20 items is stated by negative words and is scored according to the order l-4. The remaining 5 (5, 9, 13, 17, 19) are positively stated and are scored in reverse order 4-1.

The main statistical index of SAS is the total score. Adding the scores of the 20 items to obtain a coarse score; the standard fraction is obtained by multiplying the rough fraction by 1.25 and then taking the integer fraction, or the same conversion can be performed by looking up a table.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention collects the voice of the same text read by the user, marks the audio data by using the SAS scale score of the user, and then builds the anxiety prediction model based on the neural network. And constructing an automatic anxiety state prediction system based on the voice by using the anxiety prediction model.

According to one aspect of the present invention, a method for generating a speech-based anxiety prediction model is provided, comprising: step 1: collecting the voice of a user reading text and the SAS scale score of the user, and marking the voice by using the score; step 2: and extracting the voice characteristics of the voice, and constructing an anxiety prediction model by utilizing a neural network.

Preferably, in the step 2, the method includes the following steps:

s21: setting a sub-voice length N and a window x, and intercepting the voice into sub-voices with the length N

S22: windowing and segmenting the sub-voices under a window x to generate voice characteristics of the sub-voices under the window x;

s23: dividing the sub-voices into training sub-voices and test sub-voices;

s24: and taking the voice characteristics of the training sub-voices under the window x as input, taking the SAS scale score of the training sub-voices as output, and constructing the anxiety prediction model under the window x by using a neural network algorithm.

Preferably, the speech features include basic features including intensity, loudness, zero crossing rate, pitch, fundamental frequency envelope, 8 linear spectrum pairs, 12 mel-frequency cepstrum coefficients, and statistics of the basic features and statistics of the derivative value features over a time window length, the statistics including mean, standard deviation, kurtosis, skewness, and slope.

Preferably, the method further comprises:

s25: calculating the voice characteristics of the test sub-voice under a window x, inputting the voice characteristics of the test sub-voice into an anxiety prediction model under the window x to obtain a difference value between an output result and the SAS score of the test sub-voice;

s26: calculating the average difference value of the anxiety prediction model under the window x, wherein the formula is as follows: average difference = the difference/the number of test sub-voices under window x;

s27: and x is 1 to N-1, the steps S22 to S26 are repeated, the anxiety prediction model under the window x with the smallest average difference value is used as the anxiety prediction model, and the window of the anxiety prediction model is the optimal window length.

Wherein x can also be a plurality of manually set values smaller than N.

According to another aspect of the present invention, there is provided a speech-based anxiety prediction system comprising: the system comprises a data acquisition module, a voice characteristic extraction module, a training sample construction module, a neural network training module, an anxiety prediction model generation module and a prediction module, wherein,

the data acquisition module is used for acquiring tested voice;

the voice characteristic extraction module is used for receiving voice, the sub-voice length N and the window length x so as to extract and return voice characteristics under the window x;

the training sample construction module is used for collecting the voice of the user and the SAS scale score and labeling the voice with the SAS scale score; the training sample construction module can also transmit the voice, the sub-voice length and the window length of the user to the voice feature extraction module, and divide the returned sub-voice into training sub-voice and test sub-voice according to a set proportion;

the neural network training module is used for constructing an anxiety prediction model under a window x by utilizing a neural network algorithm based on the training sub-voices;

the anxiety prediction model generation module is used for generating an anxiety prediction model and an optimal window length;

and the anxiety prediction module is used for receiving the tested voice, inputting the tested voice and the optimal window length into the voice characteristic extraction module, transmitting the voice characteristics of the returned sub-voices to the anxiety prediction model, and judging the tested anxiety state according to the tested anxiety state score returned by the anxiety prediction model.

Preferably, in the voice feature extraction module, voice is intercepted into sub-voice with length of N, and then windowing and segmentation processing is performed to generate voice features of the sub-voice under the window x.

Preferably, the neural network training module receives the window length value x and the training sub-voice, takes the voice characteristic of the training sub-voice under the window x as input, takes the SAS scale score of the training sub-voice as output, and utilizes a neural network algorithm to construct the anxiety prediction model under the window x.

Preferably, in the anxiety prediction model generation module, a test sub-voice under a window x and an anxiety prediction model under the window x are received, a difference value between an output result obtained after the voice characteristic of the test sub-voice under the window x is input into the anxiety prediction model under the window x and an SAS score of the test sub-voice is calculated, and then an average difference value of the anxiety prediction model under the window x is calculated; in the anxiety prediction model generation module, traversing x from 1 to N-1 to obtain anxiety prediction models under N-1 windows x, and selecting the anxiety prediction model under the window x with the smallest average difference as the anxiety prediction model; and the window length corresponding to the anxiety prediction model is the optimal window length.

The prediction model obtained based on the text-based speech can automatically and effectively identify the anxiety condition of the user at the current moment, and the identification accuracy reaches more than 70% under high-low grouping, so that the prediction model is a convenient mode capable of early warning the psychological state.

Drawings

FIG. 1 is a flow chart of a method for generating a speech-based anxiety prediction model according to one embodiment of the present invention;

FIG. 2 is a flow chart of a method of constructing an anxiety predictive model in accordance with one embodiment of the invention;

fig. 3 is a schematic diagram of the structure of a speech-based anxiety prediction system according to one embodiment of the present invention.

Specific dimensions, structures and devices are labeled in the drawings in order to clearly realize the structure of the embodiment of the present invention, but this is only for illustrative purposes and is not intended to limit the present invention to the specific dimensions, structures, devices and environments, and those skilled in the art may make adjustments or modifications to these devices and environments according to specific needs, and the adjustments or modifications made remain included in the scope of the appended claims.

Detailed Description

The following describes the detailed description of the anxiety recognition method and the early warning system based on specific text-to-speech with reference to the accompanying drawings and specific embodiments.

In the following description, various aspects of the present invention will be described, however, it will be apparent to those skilled in the art that the present invention may be practiced with only some or all of the structures or processes of the present invention. For purposes of explanation, specific numbers, configurations and orders are set forth, it is apparent that the invention may be practiced without these specific details. In other instances, well-known features will not be described in detail so as not to obscure the invention.

In the invention, the tested person refers to the person to be tested, and the user refers to the person who collects the voice and the SAS scale score.

The invention provides a method for generating an anxiety prediction model based on voice, which is shown in fig. 1 and comprises the following steps: step 1: collecting the voice of a user reading text and the SAS scale score of the user, and marking the voice by using the score; step 2: and extracting the voice characteristics of the voice, and constructing an anxiety prediction model by utilizing a neural network.

In step 1, the individual data in the data acquisition must be of the same caliber, requiring comparability. If the user is required to read a specific text, the text is also read, and the text can be a short section of a neutral scene showplace introduction, 300-500 words. The environment for data collection should be as quiet as possible to ensure that there is no noise in the speech.

In speech-based anxiety recognition and prediction, there are many voices, such as the same neutral text that can be spoken, self-introduction of a specified outline, description of the same picture, etc. During collection, the same mode is used to ensure that the calibers are the same and the calibers are comparable.

After the user reads, the SAS scale is filled in, the scale score corresponds to the read voice, and when the voice is intercepted by the sub-voice, the scale score is marked on the sub-voice.

In step 2, 4 steps are included, as shown in fig. 2, and described in detail below.

S21: the sub-voice length N and the window x are set, the voice is intercepted into sub-voices with the length N, and N, x units can be milliseconds. Since voices of a plurality of users are collected, each user's voice is divided into a plurality of parts, that is, a plurality of sub-voices.

S22: and carrying out windowing and segmentation processing on the sub-voices under the window x to generate voice characteristics of the sub-voices under the window x.

In one embodiment, on feature extraction, 25 basic speech features (intensity, zero crossing rate, pitch, fundamental frequency envelope, 8 linear spectral pairs, 12 mel-cepstral coefficients) are first extracted, derivative value features (Δ) are calculated for all speech features separately for expressing dynamic changes of the speech features, and 5 statistics (mean, standard deviation, kurtosis, skewness, slope. Total (25+25) ×5=250 features) of the basic speech features and derivative value features are calculated on a window slicing technique.

The windowing process is prior art, in which a number of windows of x length can be truncated in a sub-speech of length N, and then speech features of the sub-speech are generated in the windows. The values of x are different, the voice characteristics are different, the values of x are described later, and the comparison of how x is different selects a prediction model formed according to x.

S23: dividing the sub-voices into training sub-voices and test sub-voices;

in the operation, the sub-voices of all users are divided into training sub-voices and test sub-voices, and in the last step, the sub-voices are subjected to voice feature extraction under the window x, so that the training sub-voices and the test sub-voices also contain the voice features under the window x, and meanwhile, the sub-voices also carry with the SAS scale scores of the voices. In one embodiment, a set proportion (e.g., 80%) of training samples (i.e., collected samples, one sample including one user's voice, the SAS gauge score of that user) are randomly selected as training data, with the remaining samples as test data.

That is, the training data of the same sampling time window x is sent to the neural network, and the neural network system is trained to obtain the relevant parameters of the neural network, so as to obtain the anxiety prediction model under the window x. The technique of training a neural network based on input and output data is a common technique for those skilled in the art, with a well-established framework for programming. Thus, different anxiety prediction models can be obtained according to different x.

In one embodiment, x traversals [1, N-1], through S22-S24 described above, can result in a model of anxiety prediction under N-1 different windows x. The anxiety prediction model under each window x is processed as follows:

s26: calculating the average difference value of the anxiety prediction model under the window x, wherein the formula is as follows: average difference = the difference/the number of test sub-voices under window x.

The anxiety prediction model under the window x with the smallest average difference value is the anxiety prediction model, and the window length corresponding to the anxiety prediction model is the optimal window length. The x may also take several preset values, so that the running speed is increased.

When the method is used, only the tested read-aloud voice is required to be collected, then the voice is divided into sub-voices according to N and x in the model, voice characteristics under a window x are generated, the voice characteristics are input into the anxiety prediction model, the anxiety state score can be obtained, whether the predicted value is in a safety range or not is judged according to the rule base, if the predicted value is in the safety range, the psychological state of the user is good, and otherwise, the psychological state abnormality is indicated. The rule base adopts the standard of SAS.

In one embodiment, to increase the calculation speed, the window length x may be set manually instead of by traversing. The audio length is 6000ms neutral text read audio, x=15ms, x=30ms and x=45ms are respectively set, and three groups of sample data are constructed. For each set of sample data, some data are randomly selected as training data, for example 80% of the samples are used as training data, namely 8, and the rest 2 are used as test data. After a neural network model of anxiety prediction is established by using training data, test data are input to obtain a prediction result, then an error between the prediction result and an actual SAS result is calculated, and the average value of the error is used as an evaluation value of the performance of the anxiety prediction model. 3 neural networks were trained for different x, with their performance evaluation values of 0.45,0.23,0.30, respectively. By comparing the three different x values, the x=30ms model with the smallest error value is selected as the optimal anxiety prediction model, namely the final anxiety prediction model.

According to another aspect of the present invention, a speech-based anxiety prediction system is provided, as shown in fig. 3, comprising a data acquisition module, a speech feature extraction module, a training sample construction module, a neural network training module, an anxiety prediction model generation module, and a prediction module, wherein,

the data acquisition module is used for acquiring tested voice;

the training sample construction module is used for collecting the voice of the user and the SAS scale score and labeling the voice with the SAS scale score; the training sample construction module can also transmit the voice, the sub-voice length and the window length of the user to the voice characteristic extraction module, and divide the returned sub-voice into training sub-voice and test sub-voice according to the set proportion;

the neural network training module is used for constructing an anxiety prediction model under a window x by utilizing a neural network algorithm based on training sub-voices;

the anxiety prediction model generation module is used for generating an anxiety prediction model and an optimal window length; the output of the anxiety predictive model is an anxiety state score;

the anxiety prediction module is used for receiving the tested voice, inputting the tested voice and the optimal window length into the voice feature extraction module, transmitting the voice features of the returned sub-voices to the anxiety prediction model, and judging the tested anxiety state according to the returned tested anxiety state scores.

In the voice feature extraction module, 25 basic features (intensity, zero crossing rate, turbidity ratio, fundamental frequency envelope, 8 linear spectrum pairs and 12 mel cepstrum coefficients) are firstly extracted, derivative value features (delta) are respectively calculated for all the basic features in order to express the dynamic change of the voice features, and 5 statistics (mean, standard deviation, kurtosis, skewness and slope) are calculated on the window segmentation technology. Thus (25+25) 5=250 features are obtained in total.

The neural network training module is used for training the anxiety prediction model through the training sub-voices transmitted by the training sample construction module; training sub-voices with the same sampling time window length x are used as input of the anxiety prediction model, and SAS scale scores of the voices are output, so that the anxiety prediction model under the window x is obtained. Training sub-speech has SAS scale scores and speech features under window x.

The anxiety prediction model generation module is used for generating an anxiety prediction model; specifically, receiving a test sub-voice under a window x and an anxiety prediction model under the window x, calculating a difference value between an output result obtained after voice characteristics of the test sub-voice under the window x are input into the anxiety prediction model under the window x and an SAS score of the test sub-voice, and then calculating an average difference value of the anxiety prediction model under the window x; in the anxiety prediction model generation module, traversing x from 1 to N-1 to obtain anxiety prediction models under N-1 windows x, and selecting the anxiety prediction model under the window x with the smallest average difference as the anxiety prediction model; and the window length corresponding to the anxiety prediction model is the optimal window length, and the output is the anxiety state score.

In order to accelerate the calculation speed, x can also be set manually without traversing N-1 times.

The anxiety prediction module is used for receiving tested voice, inputting the voice and the optimal sampling time window length into the voice feature extraction module so as to generate voice features under the optimal sampling time window, inputting the voice features into the anxiety prediction model to obtain the score of the tested anxiety state, and comparing the score with a judgment rule so as to judge whether the psychological state of the score is abnormal. The judgment rule may be SAS standard rule.

Finally, it should be noted that the above embodiments are only intended to describe the technical solution of the present invention and not to limit the technical method, the present invention extends to other modifications, variations, applications and embodiments in application, and therefore all such modifications, variations, applications, embodiments are considered to be within the spirit and scope of the teachings of the present invention.

Claims

1. A method of generating a speech-based anxiety prediction model, comprising:

step 1: collecting the voice of a user reading text and the SAS scale score of the user, and marking the voice by using the score;

step 2: extracting voice characteristics of the voice, and constructing an anxiety prediction model by utilizing a neural network; comprising the following steps:

s25: calculating the difference between the output result obtained from the anxiety prediction model of the test sub-voice under the voice characteristic input window x under the window x and the SAS score of the test sub-voice;

s26: calculating the average difference value of the anxiety prediction model under the window x;

s27: taking 1 to N-1, repeating the steps S25 and S26, and taking an anxiety prediction model under a window x with the smallest average difference value as an anxiety prediction model, wherein the window of the anxiety prediction model is the optimal window length;

where N is the sub-speech length.

2. The method according to claim 1, wherein in the step 2, the method for constructing the anxiety prediction model under the window x includes the steps of:

s21: setting a sub-voice length N and a window x, and intercepting the voice into a sub-voice with the length N;

s23: dividing the sub-voices into training sub-voices and test sub-voices;

3. The method of claim 1, wherein the speech features include a base feature, a derivative value feature of the base feature, and statistics of the base feature and the derivative value feature over a length of a time window, wherein the base feature includes intensity, loudness, zero crossing rate, opacity ratio, fundamental frequency envelope, 8 linear spectral pairs, 12 mel-frequency cepstral coefficients, and the statistics include mean, standard deviation, kurtosis, skewness, slope.

4. The method according to claim 1, wherein in step S26,

the average difference formula of the anxiety prediction model under window x is: average difference = the difference/the number of test sub-voices under window x.

5. The method according to claim 1, wherein in the step S27, x is a number of values smaller than N manually set.

6. A speech-based anxiety prediction system, the anxiety prediction system comprising: the system comprises a data acquisition module, a voice characteristic extraction module, a training sample construction module, a neural network training module, an anxiety prediction model generation module and a prediction module, wherein,

the data acquisition module is used for acquiring tested voice;

the anxiety prediction module is used for receiving the tested voice, inputting the tested voice and the optimal window length into the voice feature extraction module, transmitting the voice features of the returned sub-voices to the anxiety prediction model, and judging the tested anxiety state according to the tested anxiety state score returned by the anxiety prediction model;

in the anxiety prediction model generation module, a test sub-voice under a window x and an anxiety prediction model under the window x are received, a difference value between an output result obtained after voice characteristics of the test sub-voice under the window x are input into the anxiety prediction model under the window x and an SAS score of the test sub-voice is calculated, and then an average difference value of the anxiety prediction model under the window x is calculated; in the anxiety prediction model generation module, traversing x from 1 to N-1 to obtain anxiety prediction models under N-1 windows x, and selecting the anxiety prediction model under the window x with the smallest average difference as the anxiety prediction model; and the window length corresponding to the anxiety prediction model is the optimal window length, and the output of the anxiety prediction model is the anxiety state score.

7. The anxiety prediction system of claim 6 wherein said speech feature extraction module intercepts speech into sub-speech of length N and then performs a windowed segmentation process to generate speech features of the sub-speech under window x.

8. The anxiety prediction system of claim 6 wherein said speech features comprise basic features including intensity, loudness, zero crossing rate, blushing ratio, fundamental frequency envelope, 8 linear spectral pairs, 12 mel-frequency cepstral coefficients, statistics including mean, standard deviation, kurtosis, skewness, slope, and statistics of basic features and derivative value features over a length of a time window.

9. The anxiety prediction system of claim 7 wherein said neural network training module receives a window length value x and training sub-voices, takes as input the voice characteristics of the training sub-voices under window x, takes as output the SAS scale score of the training sub-voices, and constructs an anxiety prediction model under window x using a neural network algorithm.