CN110047469B

CN110047469B - Voice data emotion marking method and device, computer equipment and storage medium

Info

Publication number: CN110047469B
Application number: CN201910279565.0A
Authority: CN
Inventors: 王义文; 张文龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2023-12-22
Anticipated expiration: 2039-04-09
Also published as: CN110047469A

Abstract

The invention discloses a voice data emotion marking method, a device, computer equipment and a storage medium, which are applied to the field of voice data processing and are used for solving the problem that the accuracy of current voice data emotion marking is low. The method provided by the invention comprises the following steps: acquiring target voice to be marked with emotion; performing voice recognition on the target voice to obtain a target text; extracting each keyword in the target text, and recording a voice fragment corresponding to each keyword; determining a feature vector corresponding to each keyword according to the voice fragment corresponding to each keyword; after the feature vectors corresponding to the keywords are obtained, clustering the feature vectors corresponding to the keywords until preset conditions are met, and obtaining vector sets after clustering; randomly extracting a first number of feature vectors from each vector set; and obtaining scoring values of voice sentences corresponding to each vector set in each appointed emotion dimension by labeling personnel on the whole, and taking the scoring values as emotion labeling values.

Description

Voice data emotion marking method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of speech data processing, and in particular, to a method and apparatus for emotion marking of speech data, a computer device, and a storage medium.

Background

Currently, speech data is often used as one of the materials for big data analysis or machine learning, for example, deep learning models for identifying emotion when a human is speaking require learning training using a large amount of emotion-labeled speech data. In order to facilitate application to voice materials, most of the existing methods collect a large amount of voice data under different scenes, and then a special person listens to the voice data while manually judging personal emotion reflected in voice, and marks the voice data.

However, because the collected voice data come from different scenes and there are more differences among speakers, the emotion contained in the voice data is various, the voice data are difficult to accurately identify by staff according to personal experience and knowledge, and when the data size of the voice data is large, the staff is easily influenced by workload to enlarge deviation in identification, so that the accuracy of emotion marking of the voice data is further reduced.

Disclosure of Invention

The embodiment of the invention provides a voice data emotion marking method, a voice data emotion marking device, computer equipment and a storage medium, which are used for solving the problem that the accuracy of current voice data emotion marking is low.

A voice data emotion marking method comprises the following steps:

acquiring target voice to be marked with emotion;

performing voice recognition on the target voice to obtain a target text;

extracting each keyword in the target text, and recording a voice fragment corresponding to each keyword;

determining a feature vector corresponding to each keyword according to the voice fragment corresponding to each keyword;

after the feature vectors corresponding to the keywords are obtained, clustering the feature vectors corresponding to the keywords until preset conditions are met, and obtaining vector sets after clustering, wherein each vector set comprises more than one feature vector;

randomly extracting a first number of feature vectors from each of said respective sets of vectors for each set of vectors;

and obtaining scoring values of voice sentences corresponding to each vector set by labeling personnel in each appointed emotion dimension as emotion labeling values of each vector set, wherein the voice sentences corresponding to each vector set refer to voice fragments of complete sentences in which each keyword corresponding to each feature vector extracted from each vector set is respectively.

A voice data emotion marking device, comprising:

the target voice acquisition module is used for acquiring target voice to be marked with emotion;

the voice recognition module is used for carrying out voice recognition on the target voice to obtain a target text;

the keyword extraction module is used for extracting each keyword in the target text and recording a voice fragment corresponding to each keyword;

the feature vector determining module is used for determining feature vectors corresponding to the keywords according to the voice fragments corresponding to the keywords;

the feature vector clustering module is used for clustering the feature vectors corresponding to the keywords after the feature vectors corresponding to the keywords are obtained, until preset conditions are met, each vector set is obtained after clustering, and each vector set comprises more than one feature vector;

a random extraction module, configured to randomly extract, for each of the respective vector sets, a first number of feature vectors from the each vector set;

the scoring value acquisition module is used for acquiring scoring values of voice sentences corresponding to each vector set by labeling personnel in each appointed emotion dimension on the whole, wherein the voice sentences corresponding to each vector set are voice fragments of complete sentences in which each keyword corresponding to each feature vector extracted from each vector set is respectively located, and the scoring values are used as emotion labeling values of each vector set.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the voice data emotion markup method described above when the computer program is executed.

A computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the voice data emotion markup method described above.

The voice data emotion marking method, the voice data emotion marking device, the computer equipment and the storage medium are characterized in that firstly, target voice to be emotion marked is obtained; then, carrying out voice recognition on the target voice to obtain a target text; extracting each keyword in the target text, and recording a voice fragment corresponding to each keyword; and determining the feature vector corresponding to each keyword according to the voice fragment corresponding to each keyword; after the feature vectors corresponding to the keywords are obtained, clustering the feature vectors corresponding to the keywords until preset conditions are met, and obtaining vector sets after clustering, wherein each vector set comprises more than one feature vector; randomly extracting a first number of feature vectors from each of said respective sets of vectors for each set of vectors; and finally, obtaining scoring values of voice sentences corresponding to each vector set by labeling personnel in each appointed emotion dimension as emotion labeling values of each vector set, wherein the voice sentences corresponding to each vector set refer to voice fragments of complete sentences in which each keyword corresponding to each feature vector extracted from each vector set is respectively located. Before emotion marking, the invention adopts a clustering method to aggregate the voice fragments with similar feature vectors together, and marks the voice fragments in appointed emotion dimension manually after aggregation, thereby completing emotion marking of voice data, greatly reducing the workload of marking, dividing emotion dimension, helping staff accurately recognize emotion in voice and give marking branches, reducing deviation to a certain extent, and improving the accuracy of emotion marking of voice data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a voice data emotion marking method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for emotion marking of voice data according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of the emotion marking method step 104 for voice data in an application scenario according to an embodiment of the present invention;

FIG. 4 is a flowchart of the emotion marking method step 105 for voice data in an application scenario according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a method for labeling emotion of voice data according to an embodiment of the present invention, wherein the method is used for forming a data record in an application scenario;

FIG. 6 is a schematic diagram of a voice data emotion marking device according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a feature vector clustering module according to an embodiment of the invention;

FIG. 8 is a schematic diagram of a cluster determining unit according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The voice data emotion marking method provided by the application can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server through a network. The client may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a voice data emotion marking method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

101. Acquiring target voice to be marked with emotion;

in this embodiment, when the server receives a task of emotion marking for some voice data, the server may acquire the voice data and determine the voice data as a target voice to be emotion marked. It will be understood that the voice data are all the voices collected from the personnel under various scenes, specifically, a plurality of specific scenes can be considered to be set to collect the voice data, for example, a designated recitation stage can be set, and a tester reads a designated text on the stage, so as to obtain the voice data under the recitation scenes; alternatively, a specified free communication scene can be set, a plurality of testers talk to each other in the scene, and voice data can be obtained by recording voices of the communication. It is known that the voice data collected by the server in these scenes can be used as the target voice.

102. Performing voice recognition on the target voice to obtain a target text;

after the server obtains the target voices to be marked with emotion, in order to find out important voice fragments, namely voice fragments corresponding to keywords, from the target voices, the server needs to recognize the target voices, namely voice conversion processing, so as to obtain target texts corresponding to the target voices.

Preferably, in order to improve the accuracy of the speech recognition, the target speech may be preprocessed, such as denoising, deleting stop words, deleting punctuation, etc., before step 102.

103. Extracting each keyword in the target text, and recording a voice fragment corresponding to each keyword;

it can be understood that the server may preset a keyword library, where words that need to be used in emotion markup are recorded. Therefore, the server can detect whether the words recorded in the keyword library exist in the target text or not in the target text, and extract the words belonging to the keyword library in the target text as keywords. For example, it is assumed that the target text is "i am very angry", where "anger" is recorded in advance in the keyword library once, so that the server can extract "anger" once as a keyword.

In addition, after extracting each keyword, the server also needs to record the voice segment corresponding to each keyword in the keywords. By way of example, assume that the target text "I are very angry" has a time period of 0 in the target speech: 00-0:04, wherein the term "anger" corresponds to a time period of 0:03-0:04, therefore, the keyword "anger" corresponds to 0:03-0: 04.

104. Determining a feature vector corresponding to each keyword according to the voice fragment corresponding to each keyword;

in this embodiment, after determining each keyword and the voice segment corresponding to each keyword, the server may determine the feature vector corresponding to each keyword according to the voice segment corresponding to each keyword, that is, the server extracts the voice feature from the voice segment corresponding to each keyword as the feature vector corresponding to each keyword, based on each voice segment having respective voice features. For example, the word "anger" corresponds to a speech segment of 0:03-0:04, server from this 0:03-0: the feature information is extracted from the signal of the voice segment 04, and the feature information is vectorized and then used as the feature vector corresponding to the term anger.

For ease of understanding, as shown in fig. 3, further, step 104 may specifically include:

201. according to a preset first formula, calculating short-time energy of a voice fragment corresponding to each keyword, wherein the first formula is as follows:

wherein E is short-time energy, x (m) is a sampling value obtained by sampling in a preset sampling window w (m), and w (m) is expressed as follows:

202. Calculating short-time energy variance of the voice fragments corresponding to each keyword according to a preset second formula, wherein the second formula is as follows:

wherein,is the short-time energy variance;

203. calculating a short-time autocorrelation coefficient of a voice segment corresponding to each keyword according to a preset third formula, wherein the third formula is as follows:

wherein, R (len) is a short-time autocorrelation coefficient, len is a signal length intercepted by using a short-time window at an Nth sample point of the voice fragment;

204. calculating the short-time average zero-crossing rate of the voice fragments corresponding to each keyword according to a preset fourth formula, wherein the fourth formula is as follows:

wherein Z (m) is a short-time average zero-crossing rate;

205. calculating the mel cepstrum coefficient of the voice fragment corresponding to each keyword;

206. and forming a feature vector corresponding to each keyword according to the short-time energy, the short-time energy variance, the short-time autocorrelation coefficient, the short-time average zero-crossing rate and the Mel cepstrum coefficient obtained through calculation.

For the above step 201, the server may sample the time domain signal of the speech segment by windowing and framing, where the sampling window is w (m), and the sampled value is x (m), and then pass through the first step A formulaThe short-time energy of the speech segment can be calculated. Wherein w (m) can be a rectangular window, and can be specifically expressed as

For the above step 202, it can be appreciated that the server can substitute the second formula after calculating the short-term energy of the voice segment corresponding to each keywordAnd calculating the short-time energy variance of the voice fragments corresponding to each keyword.

For the above step 203, where the short-time autocorrelation coefficient is a segment of the signal truncated by a short-time window around the nth sample point of the speech segment, where len represents the length of the truncated segment of the signal, the server may apply a third formulaAnd calculating a short-time autocorrelation coefficient of the voice segment corresponding to each keyword.

For the above step 204, the short-time average zero-crossing rate represents the number of times the waveform of the speech signal crosses the horizontal axis (zero level) in one frame of speech, and the server may apply the fourth formulaAnd calculating to obtain the short-time average zero-crossing rate of the voice fragments corresponding to each keyword.

For step 205 above, mel-frequency cepstrum (Mel-Frequency Cepstrum) is based on the nonlinearity of the sound frequencyMel scale(mel scale)Logarithmic numberLinear transformation of the energy spectrum. The server may calculate mel-frequency cepstrum coefficients of the voice segment corresponding to each keyword using the following fifth formula, where the fifth formula is expressed as:

Wherein X is _i Indicating the output capability of the ith filter and Num indicating the number of filters.

For each keyword in the above step 206, the calculated short-term energy, short-term energy variance, short-term autocorrelation coefficient, short-term average zero-crossing rate, and mel-frequency spectrum coefficient may be sequentially denoted as σ1, σ2, σ3, σ4, and σ5, and the server may form a feature vector x= (σ1, σ2, σ3, σ4, and σ5) corresponding to the keyword.

105. After the feature vectors corresponding to the keywords are obtained, clustering the feature vectors corresponding to the keywords until preset conditions are met, and obtaining vector sets after clustering, wherein each vector set comprises more than one feature vector;

in order to reduce the burden of emotion marking, the embodiment considers that the features corresponding to the voice signals with the same emotion have the same points, so that the voice fragments similar in emotion dimension can be clustered by adopting a clustering method. Therefore, the server can cluster the feature vectors corresponding to the keywords until a preset condition is met, and each vector set is obtained after clustering, wherein each vector set comprises more than one feature vector. It will be appreciated that clustering of these feature vectors requires setting a termination condition for clustering, i.e. limiting the stopping of clustering of these feature vectors to a certain extent. In this embodiment, the server may preset a condition, and the condition may be specifically determined according to the actual situation.

For ease of understanding, the clustering process and the termination conditions for clustering are described in detail below. As shown in fig. 4, further, step 105 may specifically include:

301. features corresponding to the keywords the vector is determined as each initial cluster;

302. for each cluster, respectively calculating the distance between each cluster and each other cluster;

303. combining one other cluster nearest to each cluster with each cluster to obtain a new cluster;

304. judging whether each current cluster meets the preset condition, if not, returning to execute the steps 302 and 303, and if yes, executing the step 305;

305. the current clusters are determined as respective sets of vectors.

As to the above step 301, it may be understood that the clusters are the objects for which clustering is performed, and in this embodiment, the feature vectors corresponding to the keywords may be determined as the initial clusters.

For the above step 302, when calculating the distance between two clusters, the server may perform vectorization processing on the clusters to obtain each cluster vector, and then calculate the distance between the two cluster vectors. It will be appreciated that the smaller the distance between two cluster vectors, the more similar the two clusters are represented, i.e. the more similar the corresponding two first problems are; conversely, if the distance between two cluster vectors is larger, it means that the two clusters are less similar, i.e., the corresponding two first problems are less similar. In performing step 302, the server may calculate, for each cluster, the respective distances between the cluster and the other clusters, then calculate, for the next cluster, the distances between the next cluster and the other clusters, and so on, thereby calculating the respective distances between the clusters.

For the step 303, after calculating the distance between every two clusters, the server may combine two clusters with similar distances, and when combining, the clusters with the closest distances are generally combined preferentially, and the two clusters are combined to obtain a new cluster.

As to the above steps 304 and 305, it will be understood that, by repeatedly performing the above steps 302 and 303, the number of the above clusters will be smaller and smaller through multiple merging, and at the same time, the distance between any two clusters will be further and further, so the server may set the preset condition as a termination condition of clustering with respect to the distance between any two clusters or the number of all clusters after clustering, and the development of the specific termination condition is as follows. When the current clusters satisfy the preset conditions, the server may determine the current clusters as vector sets.

For ease of understanding, the above-described step 304 may include, in particular, the following steps 401-403 and/or the following steps 404-406.

401. Judging whether the number of the current clusters is smaller than or equal to a preset second number threshold value;

402. if the number of the current clusters is greater than a preset second number threshold, determining that the current clusters do not meet preset conditions;

403. If the number of the current clusters is smaller than or equal to a preset second number threshold, determining that the current clusters meet preset conditions;

or alternatively

404. Judging whether the distance between any two clusters in the current clusters is larger than a preset distance threshold value or not;

405. if the distance non-uniformity of any two clusters in the current clusters is larger than a preset distance threshold, determining that the current clusters do not meet preset conditions;

406. if the distance between any two clusters in the current clusters is larger than a preset distance threshold value, determining that the current clusters meet preset conditions.

For the above steps 401 to 403, it may be understood that the server may preset a second number threshold as a quantization standard of the clustering degree, and the server may determine whether the number of the current clusters is less than or equal to the preset second number threshold, if the number of the current clusters is greater than the preset second number threshold, it indicates that the number of the clusters is still greater, and the clustering degree of the feature vectors is not enough, so it may be determined that the current clusters do not meet the preset condition, and return to execute the above steps 302 and 303; otherwise, if the number of the current clusters is smaller than or equal to the preset second number threshold, the number of the clusters is up to the standard, and the clustering degree of the feature vectors is enough, so that the current clusters can be determined to meet the preset conditions, and the current clusters are determined to be the vector sets.

For the above steps 404-406, it may be understood that the server may preset a distance threshold for any two clusters as an index of the clustering degree, where the distance threshold defines whether the current clustering degree of each cluster meets the server requirement. Specifically, the server may determine whether the distances between any two clusters in the current clusters are all greater than a preset distance threshold, if the distances between any two clusters in the current clusters are not equal to or greater than the preset distance threshold, it indicates that there is at least one pair of clusters that are close enough together, that is, the clustering degree of each feature vector is not enough, so that it may be determined that the current clusters do not meet the preset conditions, and then the steps 302 and 303 are executed again; otherwise, if the distance between any two clusters in the current clusters is greater than the preset distance threshold, it is indicated that all clusters with sufficiently close distances in the current clusters are clustered, and the remaining clusters are far apart, which means that the clustering degree of each feature vector is sufficient, so that it can be determined that the current clusters meet the preset conditions, and the current clusters are determined as each vector set.

106. Randomly extracting a first number of feature vectors from each of said respective sets of vectors for each set of vectors;

the server obtains each vector set to stop vomiting, randomly extracts a first number of feature vectors from each vector set, the feature vectors extracted from one vector set represent the vector set, the extracted feature vectors are subsequently provided for labeling personnel to perform emotion labeling, and the labeled scoring value is also used as an emotion labeling value of the vector set. It will be appreciated that in the same vector set, the speech segments corresponding to each vector set may be considered to be similar in emotion, and thus emotion marking by sampling is feasible.

It should be noted that the first number may be set according to practical situations, for example, 5 may be taken, that is, 5 feature vectors are taken from each vector set. If the number of the feature vectors contained in a certain vector set is less than 5, extracting all the feature vectors in the vector set.

107. And obtaining scoring values of voice sentences corresponding to each vector set by labeling personnel in each appointed emotion dimension as emotion labeling values of each vector set, wherein the voice sentences corresponding to each vector set refer to voice fragments of complete sentences in which each keyword corresponding to each feature vector extracted from each vector set is respectively.

For each vector set, after randomly extracting a first number of feature vectors in each vector set, the server can send the voice fragments of the complete sentences in which the keywords corresponding to the extracted feature vectors are located to the labeling personnel for emotion labeling. As can be seen, the voice fragments sent to the labeling personnel are determined through clustering and sampling, so that the overall workload of the labeling personnel is greatly reduced. When labeling, labeling personnel are required to score the voice sentences in the designated emotion dimensions in the whole, so as to obtain scoring values, and the scoring values are fed back to the server. After obtaining scoring scores for the speech sentences, the server can take the scoring scores as emotion marking values for each vector set.

It should be noted that, for a keyword or a feature vector, a speech sentence refers to a speech segment of a complete sentence in which a keyword is located, and for example, a speech segment corresponding to a keyword of "anger" is 0:03-0:04, but the speech fragment is not a complete sentence, the complete sentence where "anger" is "i am very anger", and therefore, the speech sentence corresponding to the keyword "anger" should be 0:00-0:04 this speech segment.

In addition, since the server randomly extracts more than one feature vector from each vector set, a plurality of feature vectors may be randomly extracted, and the speech sentences in which the plurality of feature vectors are located may be different. For example, three feature vectors a, b and c are randomly extracted from the same vector set, and the speech sentences corresponding to the three feature vectors are respectively speech sentence 1, speech sentence 2 and speech sentence 3, which requires labeling personnel to score the three speech sentences respectively to obtain 3 scoring values. At this point, the server may calculate the mean of the 3 scoring scores as the emotion markup value for the vector set. Of course, 3 scoring values may also be determined directly as emotion markup values for the set of vectors.

In this embodiment, regarding the above-specified emotion dimension, for example, a Valence dimension (value) and an activation dimension (Arousal) may be included, where the Valence dimension mainly represents emotion perception of an emotion subject, and represents positive or negative degrees of emotion, such as like or dislike degrees; the activation dimension refers to the degree of activation of body energy associated with an emotional state, and is a measure of the intrinsic energy of an emotion, i.e., the severity of the emotion.

In this embodiment, in order to facilitate the labeling personnel to query information such as feature vectors and voice fragments, and at the same time, after the voice data after emotion labeling is completed is obtained by the method, the voice data is also used and retrieved by other personnel. According to the embodiment, a data record can be formed aiming at the related information of each keyword, each data record contains a plurality of pieces of information of the keyword, and when one piece of information is searched or inquired, the whole data record is carried out, so that the personnel can conveniently review. As shown in fig. 5, further, after extracting each keyword in the target text, the method further includes:

501. recording the time point of each keyword in the target voice and the corresponding voice statement;

502. forming a data record corresponding to each keyword by the time point, the voice sentence, the voice fragment and the feature vector corresponding to each keyword;

503. when a data query request aiming at a specified keyword is received, the data record corresponding to the specified keyword and the emotion marking value are fed back to a requester of the data query request.

For the above step 501, in this embodiment, after extracting each keyword in the target text, the server may record a time point of each keyword in the target voice, for example, a time point corresponding to "anger" in the above example is 0:03-0:04; and, it also records the corresponding voice sentence of each keyword in the target voice, for example, the voice sentence corresponding to the keyword "anger" is 0:00-0: 04.

For the above step 502, it may be understood that the server may form the time point, the speech sentence, the speech segment and the feature vector corresponding to each keyword into a data record corresponding to each keyword. The Time point may be denoted as Time, the Speech sentence corresponding to the keyword is denoted as spech, the Speech segment corresponding to the keyword is denoted as spech_cut, and the feature vector is X, so the data record corresponding to the keyword is t= (Time, spech, spech_cut, X).

For the above step 503, it may be known that, after the server composes the data record corresponding to each keyword and completes the emotion marking on each vector set, when the server receives the data query request for the specified keyword, the server may query the data record of the specified keyword and the emotion marking value corresponding to the feature vector X in the data record, and then the server may feed back the data record corresponding to the specified keyword and the emotion marking value to the requester of the data query request. Therefore, when a requester inquires a certain keyword, the requester can conveniently and quickly inquire the data record and the emotion marking value corresponding to the keyword, and the information such as the time point, the voice sentence, the voice fragment, the feature vector and the like corresponding to the keyword is recorded in the data record, so that the requester can be helped to know and apply the voice data and the information.

In the embodiment of the invention, firstly, target voice to be marked with emotion is obtained; then, carrying out voice recognition on the target voice to obtain a target text; extracting each keyword in the target text, and recording a voice fragment corresponding to each keyword; and determining the feature vector corresponding to each keyword according to the voice fragment corresponding to each keyword; after the feature vectors corresponding to the keywords are obtained, clustering the feature vectors corresponding to the keywords until preset conditions are met, and obtaining vector sets after clustering, wherein each vector set comprises more than one feature vector; randomly extracting a first number of feature vectors from each of said respective sets of vectors for each set of vectors; and finally, obtaining scoring values of voice sentences corresponding to each vector set by labeling personnel in each appointed emotion dimension as emotion labeling values of each vector set, wherein the voice sentences corresponding to each vector set refer to voice fragments of complete sentences in which each keyword corresponding to each feature vector extracted from each vector set is respectively located. Before emotion marking, the invention adopts a clustering method to aggregate the voice fragments with similar feature vectors together, and marks the voice fragments in appointed emotion dimension manually after aggregation, thereby completing emotion marking of voice data, greatly reducing the workload of marking, dividing emotion dimension, helping staff accurately recognize emotion in voice and give marking branches, reducing deviation to a certain extent, and improving the accuracy of emotion marking of voice data.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In an embodiment, a voice data emotion marking device is provided, where the voice data emotion marking device corresponds to the voice data emotion marking method in the above embodiment one by one. As shown in fig. 6, the voice data emotion marking device includes a target voice acquisition module 601, a voice recognition module 602, a keyword extraction module 603, a feature vector determination module 604, a feature vector clustering module 605, a random extraction module 606, and a scoring value acquisition module 607. The functional modules are described in detail as follows:

the target voice acquisition module 601 is configured to acquire target voice to be marked with emotion;

the voice recognition module 602 is configured to perform voice recognition on the target voice to obtain a target text;

a keyword extraction module 603, configured to extract each keyword in the target text, and record a speech segment corresponding to each keyword;

the feature vector determining module 604 is configured to determine a feature vector corresponding to each keyword according to a speech segment corresponding to each keyword;

The feature vector clustering module 605 is configured to cluster feature vectors corresponding to the keywords after obtaining feature vectors corresponding to the keywords, until a preset condition is met, obtain each vector set after clustering, where each vector set includes more than one feature vector;

a random extraction module 606 for randomly extracting, for each of the respective vector sets, a first number of feature vectors from the each vector set;

the scoring value obtaining module 607 is configured to obtain scoring values of the voice sentences corresponding to each vector set by the labeling personnel in each designated emotion dimension as a whole, where the voice sentences corresponding to each vector set refer to voice fragments of complete sentences where each keyword corresponding to each feature vector extracted from each vector set is located.

As shown in fig. 7, further, the feature vector clustering module 605 may include:

an initial cluster determining unit 6051 configured to determine feature vectors corresponding to the respective keywords as respective initial clusters;

a cluster distance calculation unit 6052 for calculating, for each cluster, a distance between each cluster and each other cluster, respectively;

A cluster merging unit 6053, configured to merge one other cluster nearest to each cluster with each cluster to obtain a new cluster;

a cluster judgment unit 6054 for judging whether or not the current clusters satisfy preset conditions;

a first processing unit 6055 configured to trigger the cluster distance calculating unit and the cluster merging unit if the judgment result of the cluster judging unit is no;

and a second processing unit 6056, configured to determine each current cluster as each vector set if the cluster determination unit determines that the determination result is yes.

As shown in fig. 8, further, the cluster judgment unit 6054 may include:

a cluster number judging subunit 0541, configured to judge whether the number of each current cluster is less than or equal to a preset second number threshold;

a first determining subunit 0542, configured to determine that each current cluster does not meet a preset condition if the determination result of the cluster number determining subunit is no;

a second determining subunit 0543, configured to determine that each current cluster satisfies a preset condition if the determination result of the cluster number determining subunit is yes;

or alternatively

A cluster distance unit subunit 0544, configured to determine whether the distances between any two clusters in the current clusters are all greater than a preset distance threshold;

A third determining subunit 0545, configured to determine that each current cluster does not meet a preset condition if the determination result of the cluster distance unit subunit is no;

and a fourth determining subunit 0546, configured to determine that each current cluster satisfies a preset condition if the cluster distance unit subunit has a yes judgment result.

Further, the feature vector determining module may include:

the short-time energy calculation unit is used for calculating the short-time energy of the voice segment corresponding to each keyword according to a preset first formula, wherein the first formula is as follows:

the capability variance calculating unit is configured to calculate a short-time energy variance of a speech segment corresponding to each keyword according to a preset second formula, where the second formula is:

wherein,is the short-time energy variance;

the autocorrelation coefficient calculation unit is configured to calculate a short-time autocorrelation coefficient of a speech segment corresponding to each keyword according to a preset third formula, where the third formula is:

The average zero-crossing rate calculation unit is configured to calculate a short-time average zero-crossing rate of a voice segment corresponding to each keyword according to a preset fourth formula, where the fourth formula is:

wherein Z (m) is a short-time average zero-crossing rate;

the mel-frequency cepstrum coefficient calculation unit is used for calculating the mel-frequency cepstrum coefficient of the voice fragment corresponding to each keyword;

and the feature vector composition unit is used for composing the feature vector corresponding to each keyword according to the short-time energy, the short-time energy variance, the short-time autocorrelation coefficient, the short-time average zero-crossing rate and the Mel cepstrum coefficient obtained by calculation.

Further, the voice data emotion marking device may further include:

the recording module is used for recording the time point of each keyword in the target voice and the corresponding voice statement;

the data record composition module is used for composing the time point, the voice statement, the voice fragment and the characteristic vector corresponding to each keyword into one data record corresponding to each keyword;

and the feedback module is used for feeding back the data record corresponding to the specified keyword and the emotion marking value to a requester of the data query request when the data query request aiming at the specified keyword is received.

For specific limitation of the voice data emotion marking device, reference may be made to the limitation of the voice data emotion marking method hereinabove, and the description thereof will not be repeated here. All or part of each module in the voice data emotion marking device can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data related to the emotion marking method of the voice data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a voice data emotion markup method.

In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the voice data emotion marking method in the above embodiment, such as steps 101 to 107 shown in fig. 2. Alternatively, the processor may implement the functions of each module/unit of the voice data emotion marking device in the above embodiment, such as the functions of the modules 601 to 607 shown in fig. 6, when executing the computer program. In order to avoid repetition, a description thereof is omitted.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the voice data emotion markup method of the above embodiment, such as steps 101 to 107 shown in fig. 2. Alternatively, the computer program when executed by the processor implements the functions of the modules/units of the voice data emotion marking apparatus in the above embodiment, such as the functions of the modules 601 to 607 shown in fig. 6. In order to avoid repetition, a description thereof is omitted.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. The emotion marking method for the voice data is characterized by comprising the following steps of:

acquiring target voice to be marked with emotion;

performing voice recognition on the target voice to obtain a target text;

2. The method for labeling emotion of voice data according to claim 1, wherein clustering the feature vectors corresponding to the keywords until a preset condition is satisfied comprises:

Determining the feature vectors corresponding to the keywords as initial clusters;

for each cluster, respectively calculating the distance between each cluster and each other cluster;

combining one other cluster nearest to each cluster with each cluster to obtain a new cluster;

judging whether each current cluster meets preset conditions or not;

if the current clusters do not meet the preset conditions, returning to execute the step of respectively calculating the distance between each cluster and each other cluster for each cluster, and executing the step of merging one other cluster closest to each cluster with each cluster to obtain a new cluster;

if the current clusters meet the preset conditions, determining the current clusters as vector sets.

3. The method for emotion marking of voice data according to claim 2, wherein said determining whether each current cluster satisfies a preset condition comprises:

judging whether the number of the current clusters is smaller than or equal to a preset second number threshold value;

if the number of the current clusters is greater than a preset second number threshold, determining that the current clusters do not meet preset conditions;

If the number of the current clusters is smaller than or equal to a preset second number threshold, determining that the current clusters meet preset conditions;

or alternatively

Judging whether the distance between any two clusters in the current clusters is larger than a preset distance threshold value or not;

if the distance non-uniformity of any two clusters in the current clusters is larger than a preset distance threshold, determining that the current clusters do not meet preset conditions;

if the distance between any two clusters in the current clusters is larger than a preset distance threshold value, determining that the current clusters meet preset conditions.

4. The method for emotion marking of voice data according to claim 1, wherein said determining the feature vector corresponding to each keyword according to the voice segment corresponding to each keyword comprises:

according to a preset first formula, calculating short-time energy of a voice fragment corresponding to each keyword, wherein the first formula is as follows:

calculating short-time energy variance of the voice fragments corresponding to each keyword according to a preset second formula, wherein the second formula is as follows:

Wherein,is the short-time energy variance;

calculating a short-time autocorrelation coefficient of a voice segment corresponding to each keyword according to a preset third formula, wherein the third formula is as follows:

calculating the short-time average zero-crossing rate of the voice fragments corresponding to each keyword according to a preset fourth formula, wherein the fourth formula is as follows:

wherein Z (m) is a short-time average zero-crossing rate;

calculating the mel cepstrum coefficient of the voice fragment corresponding to each keyword;

and forming a feature vector corresponding to each keyword according to the short-time energy, the short-time energy variance, the short-time autocorrelation coefficient, the short-time average zero-crossing rate and the Mel cepstrum coefficient obtained through calculation.

5. The voice data emotion markup method according to any one of claims 1 to 4, characterized by further comprising, after extracting respective keywords in said target text:

recording the time point of each keyword in the target voice and the corresponding voice statement;

forming a data record corresponding to each keyword by the time point, the voice sentence, the voice fragment and the feature vector corresponding to each keyword;

When a data query request aiming at a specified keyword is received, the data record corresponding to the specified keyword and the emotion marking value are fed back to a requester of the data query request.

6. A voice data emotion marking device, comprising:

7. The voice data emotion markup apparatus of claim 6, wherein said feature vector clustering module comprises:

an initial cluster determining unit, configured to determine feature vectors corresponding to the keywords as initial clusters;

a cluster distance calculating unit, configured to calculate, for each cluster, a distance between each cluster and each other cluster;

a cluster merging unit, configured to merge one other cluster nearest to each cluster with each cluster to obtain a new cluster;

the cluster judging unit is used for judging whether each current cluster meets preset conditions or not;

the first processing unit is used for triggering the cluster distance calculating unit and the cluster merging unit if the judging result of the cluster judging unit is negative;

And the second processing unit is used for determining each current cluster as each vector set if the judgment result of the cluster judgment unit is yes.

8. The voice data emotion marking apparatus according to claim 7, wherein said cluster judgment unit includes:

a cluster number judging subunit, configured to judge whether the number of each current cluster is less than or equal to a preset second number threshold;

a first determining subunit, configured to determine that, if the determination result of the cluster number determining subunit is no, the current clusters do not meet a preset condition;

a second determining subunit, configured to determine that, if the determination result of the cluster number determining subunit is yes, the current clusters satisfy a preset condition;

or alternatively

A cluster distance unit subunit, configured to determine whether the distances between any two clusters in the current clusters are both greater than a preset distance threshold;

a third determining subunit, configured to determine that, if the determination result of the cluster distance unit subunit is no, the current clusters do not meet a preset condition;

and the fourth determining subunit is used for determining that each current cluster meets the preset condition if the judging result of the cluster distance unit subunit is yes.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the voice data emotion marking method of any of claims 1 to 5 when the computer program is executed.

10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the voice data emotion markup method according to any one of claims 1 to 5.