Disclosure of Invention
The application provides a user portrait construction method and device, wherein a plurality of user characteristic label models including a user identity label model, a gender label model, an age label model and the like are built through deep learning, and judgment results of the user characteristic label models are respectively utilized to judge all labels of the same user, so that the user portrait is constructed. The method provided by the application can accurately finish the user portrait under the condition that the user does not sense.
The present application provides an object to provide the following aspects:
in a first aspect, the present application provides a user representation construction method, including: collecting the ith user voice, wherein i is 1, 2, 3, … …; acquiring short-time frame voice, wherein the short-time frame voice is generated by framing the ith user voice; acquiring short-time frame feature vectors, wherein the short-time frame feature vectors are extracted from each frame of short-time frames generated according to the ith user voice; acquiring a voice feature vector, wherein the voice feature vector is generated according to all short-time frame feature vectors corresponding to the ith user voice; generating a user feature label, wherein the user feature label is generated by using a preset model according to the voice feature vector, and the feature label comprises a user identity label, a gender label and an age label; if the number of user voices with the same user identity label reaches a preset value, acquiring voice to be imaged, wherein the voice to be imaged is all the user voices marked with the user identity label; and generating the user portrait of the user according to all the gender labels and the age labels generated by the voice to be portrait.
The user portrait construction method provided by the application is provided with the user identity label model in advance, can automatically collect user voice and carry out user portrait construction, and can generate the final user portrait by utilizing a plurality of user portrait construction results for the same user, so that the user portrait construction can be automatically carried out under the condition that the user does not feel, and the user does not need to specially carry out identity authentication.
In combination with the first aspect, the method further comprises: if the number of the user voices with the same user identity label is smaller than a preset value, acquiring the (i + 1) th user voice; acquiring short-time frame voice, wherein the short-time frame voice is generated by framing the (i + 1) th user voice; acquiring short-time frame feature vectors, wherein the short-time frame feature vectors are extracted from each frame of short-time frames generated according to the (i + 1) th user voice; acquiring a voice feature vector, wherein the voice feature vector is generated according to all short-time frame feature vectors corresponding to the i +1 th user voice; and generating a user feature label, wherein the user feature label is generated by utilizing a preset model according to the voice feature vector, and the feature label comprises a user identity label, a gender label and an age label.
In a preferred mode, the acquiring short-time frame speech includes: dividing the ith user voice into a plurality of short-time frame voices according to a preset time length; and if the time frame length of the last short-time frame voice is less than the preset time length, the last short-time frame voice is discarded.
In a preferred mode, the acquiring the short-time-frame feature vector includes: pre-enhancing the high-frequency part of the short-time frame voice by taking a frame as a unit for each short-time frame voice; converting the short-time frame after pre-enhancement into a frequency domain signal; calculating short-time frame feature vectors of each short-time frame according to the frequency domain signals, wherein the short-time frame feature vectors comprise inverse Mel energy spectrums.
In a preferred mode, the obtaining the speech feature vector includes: and sequentially fusing all short-time frame feature vectors corresponding to the ith user voice to generate voice feature vectors.
In a preferred mode, the generating a user identity tag according to the speech feature vector by using a preset voiceprint model includes: generating an identity vector of the voice by using a preset voiceprint model according to the voice feature vector of the generated ith user voice; acquiring all stored identity vectors; calculating the cosine distance between the identity vector of the ith voice and each stored identity vector; and generating a user identity label according to the cosine distance.
In a preferred mode, the generating a gender tag using a gender model from the speech feature vectors includes: and generating a gender probability by utilizing a preset gender model according to the voice feature vector of the ith user voice, and generating a gender label according to the gender probability.
In a preferred mode, the generating an age tag by using the age model according to the speech feature vector comprises: generating age bracket probability by using a preset age model according to the voice feature vector of the ith user voice; and generating an age label according to the age group probability.
In a preferred mode, if the number of the user voices exceeds an accumulated preset value, generating the user portrait of the user according to all the gender tags and the age tags generated by the voice to be portrait includes: acquiring all gender labels and age labels generated by the voice to be imaged; determining a gender label and an age label for constructing the user portrait according to all the gender labels and the age labels by a majority election method; a user representation of the user is generated, the user representation including a user identity tag, a gender tag determined using majority voting, and an age tag of the user.
Compared with the prior art, the user portrait construction method can automatically construct the user portrait by utilizing the preset user identity label model under the condition that the user does not sense, so that the problem that the user needs to particularly identify and authenticate the traditional user portrait is solved.
In a second aspect, the present application also provides a user representation construction apparatus, the apparatus comprising: the voice collecting unit of the user, is used for gathering the ith user's voice, wherein, i ═ 1, 2, 3, … …; the short-time frame acquisition unit is used for acquiring short-time frame voice which is generated by framing the ith piece of user voice; the feature vector generating unit is used for acquiring short-time frame feature vectors, and the short-time frame feature vectors are feature vectors extracted from each frame of short-time frames generated according to the ith user voice; the feature vector generating unit is further used for acquiring a voice feature vector, and the voice feature vector is generated according to all short-time frame feature vectors corresponding to the ith user voice; the user feature label generating unit is used for generating a user feature label, the user feature label is generated by using a preset model according to the voice feature vector, and the feature label comprises a user identity label, a gender label and an age label; the user portrait construction unit is used for acquiring the voice to be pictured if the number of user voices with the same user identity label reaches a preset value, wherein the voice to be pictured is all the user voices marked with the user identity label; and the user portrait constructing unit is also used for generating the user portrait of the user according to all the gender labels and the age labels generated by the voice to be pictured.
In a third aspect, the application further provides a user portrait construction terminal, where the terminal includes a voice acquisition device and the user portrait construction device of the second aspect.
In a fourth aspect, the present application further provides a computer storage medium, which may store a program, and when the program is executed, the computer storage medium may implement some or all of the steps of the state detection method according to the first aspect.
In a fifth aspect, the present application further provides a terminal, including: a transceiver, a processor and a memory, the processor being capable of executing a program or instructions stored in the memory to thereby implement a method as described in the first aspect.
In a sixth aspect, the present application also provides a computer program which, when run on a computer, causes the computer to perform the method of the first aspect.
In a seventh aspect, the present application further provides a method for providing personalized services based on the user portrait construction method in the first aspect, where the method includes: acquiring user voice, and generating a user feature tag according to the user voice; calling a user portrait corresponding to the user feature tag; and sending a personalized service providing instruction to the intelligent equipment according to the user image.
The method for providing the personalized service is based on the first aspect and the method for constructing the user portrait without perception, so that the user experience is improved, the steps of particularly performing identity authentication and identification are omitted, the service can be provided for a new user according to a preset mode, and the user experience is improved.
With reference to the seventh aspect, in an implementable manner, the method further comprises: if the user portrait is not retrieved, a service is provided according to a preset mode, and the user portrait of the user is generated according to the user voice by using the method of the first aspect.
In an eighth aspect, the present application further provides a personalized service providing apparatus, including: the user voice acquisition module is used for acquiring user voice and generating a user characteristic label according to the user voice; the user portrait calling module is used for calling the user portrait corresponding to the user characteristic label; and the service instruction generating module is used for sending an instruction for providing personalized service to the intelligent equipment according to the user image.
In a ninth aspect, the present application further provides a personalized service providing terminal, where the terminal includes a voice collecting device, and the personalized service providing device according to the eighth aspect.
In a tenth aspect, the present application further provides a computer storage medium, which can store a program, and when the program is executed, the computer storage medium can implement some or all of the steps of the state detection method according to the seventh aspect.
In an eleventh aspect, the present application further provides a terminal, including: a transceiver, a processor, and a memory, wherein the processor can execute the programs or instructions stored in the memory to implement the method according to the seventh aspect.
Detailed Description
The features and advantages of the present invention will become more apparent and appreciated from the following detailed description of the invention.
The present invention is described in detail below.
With the development of smart devices, more and more smart devices are used in life, such as smart televisions, smart speakers, etc., which can provide corresponding services according to the instructions of users, and even provide personalized services for a certain user, for example, a family includes a plurality of family members with different genders and ages, and the preference of each family member for television programs is different, and when different family members respectively send an instruction of "please turn to favorite programs", the smart devices can provide different television channels according to the personal preferences of different family members. Such a method for providing personalized services is generally based on user portrayal results, while the traditional method for portraying a user is to portray a user once before using the smart device, for example, if a family includes 5 family members, the 5 family members need to be portrayed once respectively, information such as age, sex, and preference of the user is input, or a specific voice is given to the smart device, a specific expression is made, and the smart device generates a user portrayal according to the information. If a family member does not have a user portrait on the smart device, the smart device may not provide personalized services to the family member.
Because the traditional user portrait method needs a user to perform a special user portrait process, and the operation of the process is complicated, the user experience is poor, and some users abandon the portrait of the user even for avoiding the trouble or for some reasons, so that the personalized service function of the intelligent device is forbidden passively.
The user portrait method provided by the application does not need a user to particularly portrait the user, and can automatically start the portrait of the user under the condition that the user does not sense the portrait, so that the experience degree of the user is improved.
In this embodiment, the technical solution of the present application is described by taking an example of user portrayal of each family member of a certain family. In this example, the family includes five family members, wherein the five family members are respectively a grandfather (male, age 70), a grandmother (female, age 70), a mom (female, age 40), and a child (male, age 10); the intelligent device for user image drawing is an intelligent television. The intelligent television is loaded with a program capable of implementing the method, a program for automatically selecting television channels, a program for automatically searching network televisions and at least one shopping APP. The intelligent television is provided with the voice acquisition device, the processor and the player, wherein the voice acquisition device and the processor can be electrically connected and can also be connected in a wireless communication mode such as WIFI. The player is used for playing the selected television program and can be electrically connected with the processor or be connected with the processor in a wireless communication mode such as WIFI. The processor is used for carrying out user portrait according to the voice information uploaded by the voice acquisition device and providing personalized services for different users according to user portrait results.
The scheme of the application is a user portrait construction method based on a deep neural network model, and particularly relates to a user portrait construction method obtained through large-scale audio training by using a sequence to sequence (sequence to sequence) learning model. The learning model comprises models such as LSTM-RNN, wherein LSTM is a Long Short-Term Memory Network (Long Short-Term Memory) and RNN is a Recurrent Neural Network (Recurrent Neural Network).
In this embodiment, the used models include a preset voiceprint model, a preset gender model and a preset age model, the three models can be established in the model training stage, and the three models can be deep neural network models.
As a machine learning model, the state detection system based on the deep neural network comprises two stages of model training and model use.
The model training stage is a stage of determining parameters of each module on the neural network through algorithms such as back propagation and the like according to training voice.
And the model using stage is that after the model training stage is finished, the trained model is used, the speech to be served acquired by the target equipment is used as input, and the user portrait is drawn for the user through the calculation of each module in the neural network system, so that a user portrait result is generated.
For ease of understanding, the model training phase is first introduced.
The model training phase is mainly used for determining parameters of each calculation module in the model, and the parameters of the neural network can be represented as (W, b), where W represents a parameter matrix and b represents a bias term, and thus, the parameters required to be determined in the model training phase include a weight matrix W and a bias term b.
In this embodiment, a method adopted by model training is described by taking a preset voiceprint model as an example, fig. 1 is a schematic flow chart of the method for training the preset voiceprint model provided in this embodiment, and specifically, with reference to fig. 1, the preset voiceprint model training stage includes:
and S111, acquiring training voice.
In this embodiment, the training speech may be from a live acquisition or from a network. The training speech used to train the voiceprint model is at least pre-tagged with identity information, i.e., the speakers of the training speech, each speaker having a unique tag.
Optionally, at least 2 training voices are provided per speaker.
In this embodiment, the training speech may be a digital speech, or may be a fixed phrase or a free text.
And S112, acquiring short-time frame training voice.
In this embodiment, the training speech may be framed into a plurality of speech segments according to a certain time length, i.e., short time frames, each of which may be 20ms in length, for example.
Optionally, in this embodiment, the training speech is framed by using a frame overlap framing method, that is, two adjacent short-time frames obtained by framing overlap, for example, 1ms to 20ms of the training speech are divided into a first short-time frame, 11ms to 31ms of the training speech are divided into a second short-time frame, and so on, to obtain all the short-time frames of the training speech.
Further, the duration of the frame overlap may be 1/2N, where N represents the duration of each short-time frame, so as to avoid excessive energy variation of two short-time frames, and the frame overlap is performed in such a way that two adjacent short-time frames generate an overlap region, and the overlap region contains M sampling points, and M is usually 1/2 or 1/3 of N.
Further, if the duration of the last short-time frame is less than the duration of the short-time frame, the last short-time frame is deleted.
And S113, acquiring training short-time frame feature vectors.
In the present embodiment, the short-time-frame feature vector includes a mel-energy spectrum, a mel-frequency cepstrum coefficient, and the like. The short-time frame feature vectors are used for establishing a state detection model based on audio features, and the short-time frame feature vectors can be used as input information of the model.
In one implementable form, said feature extracting for each short-time frame generating short-time frame feature vectors comprises:
s1131, performing pre-enhancement on each short-time frame to generate an enhanced short-time frame.
In this embodiment, before the feature vectors of the short time frame are extracted, the high-frequency signals in the short time frame are enhanced, so as to eliminate vocal cord effect and lip effect generated in the sounding process, further compensate the high-frequency part of the speech signal suppressed by the sounding system, and meanwhile highlight the formants of high frequencies.
Alternatively, the high frequency signal in each short time frame may be emphasized using the formula shown in the following equation (1):
a (n) ═ s (n) — k · s (n-1) formula (1)
Wherein, A (n) represents the enhanced signal strength;
s (n) represents the signal strength for the nth millisecond;
s (n-1) represents the signal strength at (n-1) th millisecond;
k represents an enhancement coefficient, and the value range of k is [0,1], in this embodiment, the value of k may be 0.97;
n is 1, 2, 3 … …, N, where N is the duration of each short-time frame, e.g., 20ms, and N is 20.
S1132, sequentially converting each enhanced short-time frame into a frequency-domain signal.
In this embodiment, the method of converting the short-time frame into the frequency domain signal is to perform FFT on each enhanced short-time frame, where the FFT is fast fourier transform, and the fast fourier transform is a method of calculating discrete fourier transform by using a computer, and any specific way of performing FFT in the prior art may be adopted.
After FFT conversion is carried out on each enhanced short-time frame, the time domain signals of the short-time frames can be converted into frequency domain signals, and therefore subsequent processing is facilitated. The time domain signal refers to a signal describing a mathematical function or a physical signal versus time, and the frequency domain signal refers to a signal describing a mathematical function or a physical signal versus frequency.
Since the audio collected by the present embodiment is real audio, i.e., digital audio, rather than analog audio, the present embodiment may convert the time domain signal of each short time frame into a frequency domain signal using FFT.
And S1133, calculating the Mel energy spectrum of each short-time frame according to the FFT conversion result.
Since the energy of the frequency domain signal obtained by the FFT is not uniform in each frequency band, the mel-frequency energy spectrum can be generated by using a triangular filter.
In this embodiment, the number of the triangular filters may be set according to the requirement, for example, taking this embodiment as an example, 40 triangular filters may be selected.
In one implementation, the present embodiment may employ a triangular filter as shown in FIG. 2.
Further, the frequency spectrum of each short time frame may be obtained by using the frequency domain signal generated in step S1132, the frequency spectrum refers to the characteristic that the signal is usually difficult to judge by transforming the signal in the time domain, so that the signal is usually converted into an energy distribution in the frequency domain to observe, and different energy distributions can represent the characteristics of different voices. After multiplying by the hamming window, the short-time frame must be subjected to fast fourier transform to obtain the energy distribution on the frequency spectrum, that is, the frequency spectrum of each short-time frame is obtained by performing fast fourier transform on each frame signal after the short-time frame is windowed, and the logarithmic energy under the triangular filter is generated by using the frequency spectrum, specifically, the logarithmic energy can be calculated by using the formula shown in the following formula (2):
wherein M is more than or equal to 0 and less than or equal to M,
s (m) represents the logarithmic energy of each triangular filter bank output;
m represents the center frequency of the triangular filter;
m represents the number of the triangular filters;
n represents the number of points of Fourier transform;
Xa(k) a discrete Fourier transform result representing the speech signal;
a represents the Hamming window constant, typically 0.46;
Hm(k) representing a mel-energy spectrum;
k denotes the number of points of the fourier transform.
Further, a mel-frequency energy spectrum is generated by a complex absolute value method according to the logarithmic energy, and specifically, the mel-frequency energy spectrum can be calculated by using a formula shown in the following formula (3):
and S1134, respectively calculating log Fbank feature vectors of each short-time frame according to the Mel energy spectrum.
In the present embodiment, the log Fbank feature vector is generated by taking the logarithm of the result generated in step S1133. Specifically, the log Fbank feature vector may be calculated according to a formula shown in formula (4), wherein each short-time frame corresponds to a mel-energy spectrum:
the embodiment calculates the log Fbank feature vector to amplify the energy difference at low energy and to reduce the energy difference at high energy. The high energy and the low energy respectively refer to the energy amplitude at different frequencies.
In this step, a 40-dimensional log Fbank vector may be generated, taking the foregoing example as an example.
And S1135, generating an MFCC feature vector of each short-time frame according to the log Fbank feature vector.
In this embodiment, the log Fbank feature vector may be subjected to Discrete Cosine Transform (DCT) by using the triangular filters used in step S1133, and each triangular filter may generate an MFCC feature vector, i.e., mel-frequency cepstrum coefficients (MFCC coefficients).
In this step, by taking the foregoing example as an example, 40 MFCC feature vectors may be generated.
In this step, the method of generating MFCC coefficients by performing discrete cosine transform using a triangular filter and log Fbank eigenvectors in the prior art can be used for calculation.
In this embodiment, all generated MFCC coefficients may be sorted in order of smaller center frequency to larger center frequency, and the first several MFCC coefficients in the sequence may be retained, and the rest MFCC coefficients may be discarded. For example, in this embodiment, if 40 triangular filters used in step S1133 are adopted to generate 40 MFCC coefficients, the generated 40 MFCC coefficients may be sorted according to the order of the center frequency from small to large, and the first 20 MFCC coefficients are retained, while the rest MFCC coefficients are discarded, so as to compress the data.
Optionally, the retained MFCC coefficients may also be subjected to first order difference processing and second order difference processing to generate delta coefficients and delta-delta coefficients. The first order difference processing and the second order difference processing are mathematical methods commonly used in the prior art.
Further, the remaining 20 MFCC coefficients, the first-order difference and the second-order difference are fused to obtain a 60-dimensional vector.
Thus, each short-time frame can be represented by a 40-dimensional log Fbank vector plus a 60-dimensional MFCC series vector, i.e., a 100-dimensional vector.
And S114, acquiring a voice feature vector.
In this embodiment, the speech feature vector is generated by sequentially fusing all short-time frame feature vectors corresponding to the current training speech.
In this embodiment, the fusion is an average value.
And S115, generating a preset voiceprint model according to the training voice feature vector and the identity information of the training voice.
And (4) processing each training voice in steps S112 to S114 to generate a corresponding voice feature vector, taking the voice feature vector as input information of the neural network model LSTM-RNN, taking identity information corresponding to the training voice as an output result, and after training of a large amount of training voices, continuously updating and correcting each parameter in the preset voiceprint model based on the voice features, thereby obtaining a relatively perfect preset voiceprint model.
In this embodiment, the preset voiceprint model determines whether the current training voice belongs to one of the historical training voices by using the similarity between the current training voice and the historical training voices, if so, marks a label of the historical training voice on the current training voice, and if not, allocates a new label to the current training voice.
In this embodiment, the similarity between the current training speech and the historical training speech may be calculated by using the cosine similarity.
Optionally, the cosine similarity is a similarity commonly used in the art, and the calculation method may adopt a calculation method of cosine similarity commonly used in the prior art.
And continuously training the preset voiceprint model, and continuously updating the cosine similarity threshold.
And S116, testing the preset voiceprint model by using other training voices, and finishing modeling if the testing accuracy is greater than or equal to the testing result threshold.
In this embodiment, in order to ensure the accuracy of the test result of the built model, at the end of the model building, the preset voiceprint model needs to be tested by using a training speech, and the training speech used for the test is an unused training speech.
And if the accuracy of the preset voiceprint model in the test stage is lower than the test result threshold, continuing to train the preset voiceprint model by using new training voice according to the method until the accuracy of the preset voiceprint model in the test stage is greater than or equal to the test result threshold.
And at this point, the state preset voiceprint model based on the voice characteristics completes modeling.
In this embodiment, a method adopted by model training is further described by taking a preset gender model as an example, specifically, the preset gender model training stage includes:
and S121, acquiring training voice.
The same training speech may be used in training different models, or different training speech may be used. Therefore, the training speech used in step S121 may be the same as or different from the training speech used in step S111.
The training speech used to train the voiceprint model is pre-labeled with at least gender information, with each speaker having a unique label.
For a specific implementation of this step, refer to step S111, which is not described herein again.
And S122, acquiring short-time frame training voice.
For a specific implementation of this step, refer to step S112, which is not described herein again.
And S123, acquiring training short-time frame feature vectors.
For a specific implementation of this step, refer to step S113, which is not described herein again.
And S124, acquiring training voice feature vectors.
For a specific implementation of this step, refer to step S114, which is not described herein again.
And S125, generating a preset gender model according to the training short-time frame feature vector and the gender corresponding to the training voice.
Processing each training voice in steps S122 to S124 to generate a corresponding voice feature vector, taking the voice feature vector as input information of the neural network model LSTM-RNN, taking the gender corresponding to the training voice as an output result, and after training a large amount of training voices, continuously updating and correcting each parameter in the preset gender model based on the voice feature, thereby obtaining a more perfect preset gender model.
In this embodiment, for the same training voice, the sum of the probability that the training voice is output as a male and the probability that the training voice is output as a female is 1.
And S126, testing the preset gender model by using other training voices, and finishing modeling if the testing accuracy is greater than or equal to the testing result threshold.
For a specific implementation of this step, refer to step S116, which is not described herein again.
Alternatively, the age in step S125 may be an accurate age according to actual needs, or may be an age range, and accordingly, the output result of the trained preset age model is an accurate age or an age range.
In this embodiment, the training mode of the preset age model is similar to the training mode of the preset gender model, and the difference is only that in step S125, the preset age model is generated according to the training speech feature vector and the age corresponding to the training speech.
In the present embodiment, the sum of the probabilities that the training speech is output for each age or age group is 1 for the same training speech.
Further, the training mode of other user label models is similar to the training mode of the preset gender model, and the difference is only that step S125 generates a preset identity label model according to the training voice feature vector and other user labels corresponding to the training voice.
Fig. 3 is a schematic flow chart of a user portrait construction method preferred in this embodiment, and for a model use phase, that is, the user portrait construction method provided in this application, as shown in fig. 3, the method includes:
s201, collecting the ith user voice, where i is 1, 2, 3, … ….
In this embodiment, the user voice may be collected by using a voice collecting device on the smart device.
In one implementation, the user speech includes wake-up words, command words, and/or free text, among others.
The awakening word is a preset word for starting the user portrait processing program, for example: the awakening words can be set by a user in a self-defined way or by a system developer.
The command word is a fixed phrase, typically a bingo structure, such as: and playing music.
The free text is free text and can be an instruction or a question randomly issued by a user, for example: today what the weather is.
S202, short-time frame voice is obtained, and the short-time frame voice is generated by framing the ith user voice.
In this embodiment, the acquiring the short-time frame speech includes:
s221, dividing the ith user voice into a plurality of short-time frame voices according to a preset time length;
s222, if the time frame length of the last short-time frame voice is smaller than the preset time length, the last short-time frame voice is discarded.
For a specific implementation of this step, refer to step S112, which is not described herein again.
S203, short-time frame feature vectors are obtained, wherein the short-time frame feature vectors are feature vectors extracted from each frame of short-time frames generated according to the ith user voice;
in this embodiment, the acquiring the short-time-frame feature vector includes:
s231, pre-enhancing the high-frequency part of each short-time frame voice by taking a frame as a unit;
s232, converting the short-time frame after pre-enhancement into a frequency domain signal;
and S233, calculating short-time frame feature vectors of each short-time frame according to the frequency domain signals, wherein the short-time frame feature vectors comprise reverse Mel energy spectrums.
For a specific implementation of step S203, refer to step S113, which is not described herein again.
S204, obtaining a voice feature vector, wherein the voice feature vector is generated according to all short-time frame feature vectors corresponding to the ith user voice.
In this embodiment, the obtaining the speech feature vector includes: and sequentially fusing all short-time frame feature vectors corresponding to the ith user voice to generate voice feature vectors.
S205, generating a user feature tag, wherein the user feature tag is generated by using a preset model according to the voice feature vector, and the feature tag comprises a user identity tag, a gender tag and an age tag;
in this embodiment, the generating the user identity tag by using a preset voiceprint model according to the voice feature vector includes: generating an identity vector of the voice by using a preset voiceprint model according to the voice feature vector of the generated ith user voice; acquiring all stored identity vectors; calculating the cosine distance between the identity vector of the ith voice and each stored identity vector; and generating a user identity label according to the cosine distance.
Specifically, the generating the user identity tag according to the cosine distance includes:
if the cosine similarity between the user voice and a certain historical user voice is larger than a cosine similarity threshold value, the user voice is marked by the identity label of the historical user voice, and if the cosine similarity between the user voice and any historical user voice is smaller than the cosine similarity threshold value, the user voice is marked by a new identity label.
For the same user voice, an identity label, an age label and a gender label are marked at least at the same time. On the basis of the identity label, if the number of the accumulated user voices belonging to the same speaker is smaller than a preset accumulated value, when the next user voice of the speaker is received, the user information such as the age, the sex and the like of the speaker is continuously judged by utilizing the preset models and marked, and then the user information of the speaker is corrected by a majority election method until the number of the accumulated user voices belonging to the same speaker is larger than or equal to the preset accumulated value, so that the user portrait is finished, specifically:
s2061, if the number of the user voices with the same user identity label reaches a preset value, acquiring the voice to be imaged, wherein the voice to be imaged is the voice of all the users marked with the user identity labels.
S20611, generating the user portrait of the user according to all the gender labels and the age labels generated by the voice to be portrait.
In a preferred mode, if the number of the user voices exceeds an accumulated preset value, generating the user portrait of the user according to all the gender tags and the age tags generated by the voice to be portrait includes:
acquiring all gender labels and age labels generated by the voice to be imaged;
determining a gender label and an age label for constructing the user portrait according to all the gender labels and the age labels by a majority election method;
a user representation of the user is generated, the user representation including a user identity tag, a gender tag determined using majority voting, and an age tag of the user.
S2062, if the number of the user voices with the same user identity label is smaller than a preset value, collecting the (i + 1) th user voice;
s20621, obtaining short-time frame voice, wherein the short-time frame voice is generated by framing the i +1 th user voice;
s20622, obtaining short-time frame feature vectors, wherein the short-time frame feature vectors are feature vectors extracted from each frame of short-time frames generated according to the (i + 1) th user voice;
s20623, obtaining a voice feature vector, wherein the voice feature vector is generated according to all short-time frame feature vectors corresponding to the i +1 th user voice;
s20624, generating a user feature label, wherein the user feature label is generated by utilizing a preset model according to the voice feature vector, and the feature label comprises a user identity label, a gender label and an age label.
In a second aspect, the present application also provides a user representation construction apparatus, the apparatus comprising:
a user voice collecting unit 101, configured to collect an ith user voice, where i is 1, 2, 3, … …;
a short-time frame acquisition unit 102, configured to acquire short-time frame speech, where the short-time frame speech is generated by framing an ith user speech;
a feature vector generation unit 103, configured to acquire short-time frame feature vectors extracted from each frame of short-time frames generated according to the ith user speech;
the feature vector generating unit 103 is further configured to obtain a speech feature vector, where the speech feature vector is generated according to all short-time frame feature vectors corresponding to the ith user speech;
a user feature tag generating unit 104, configured to generate a user feature tag, where the user feature tag is generated according to the voice feature vector by using a preset model, and the feature tag includes a user identity tag, a gender tag, and an age tag;
the user portrait constructing unit 105 is configured to, if the number of user voices with the same user identity tag reaches a preset value, acquire voices to be pictured, where the voices to be pictured are voices of all users marked with the user identity tag;
the user representation construction unit 105 is further configured to generate a user representation of the user according to all the gender tags and age tags generated by the voice to be represented.
Further, in a model using stage, an embodiment of the present application further provides a method for providing personalized services based on the user portrait construction method in the first aspect, where the method includes:
s301, obtaining user voice, and generating a user feature tag according to the user voice.
For a specific implementation manner of this step, reference may be made to steps S201 to S205, which are not described herein again.
S302, the user portrait corresponding to the user feature tag is called.
For a specific implementation manner of this step, reference may be made to steps S2061 to S20624, which are not described herein again.
And S303, sending a personalized service providing instruction to the intelligent equipment according to the user image.
In this embodiment, personalized services may be provided to the speaker based on the user profile results. For example, when a speaker says "Xiao i classmate, i want to watch a movie" to a smart television, if a user portrait result is 70 years old and male, a movie suitable for old men to watch is provided according to the user portrait; if the user portrait results in 40 years old and female, providing a movie suitable for the middle-aged female to watch according to the user portrait; if the user portrait results in a 10 year old, male, a movie suitable for viewing by male children is provided based on the user portrait. For another example, for the shopping APP on the smart television, when the speaker says 'long style coat', if the user portrait result is 70 years old and male, then a long style coat search result suitable for old male is provided according to the user portrait; if the user portrait result is 40 years old and female, providing a long style coat search result suitable for middle-aged female according to the user portrait; if the user representation results in a 10 year old, male, then a long style search result for male children is provided based on the user representation.
Further, if the user image is not retrieved, a service is provided according to a preset mode, the user image of the user is generated according to the user voice by using the user image construction method, and after the user image of the speaker is generated, a personalized service is provided for the speaker by using the methods of the steps S301 to S303.
In an eighth aspect, the present application further provides a personalized service providing apparatus, including:
a user voice obtaining module 301, configured to obtain a user voice, and generate a user feature tag according to the user voice;
a user portrait retrieval module 302, configured to retrieve a user portrait corresponding to the user feature tag;
and the service instruction generating module 303 is configured to send an instruction for providing personalized service to the intelligent device according to the user image.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.