CN116564339A

CN116564339A - Safe and efficient vehicle-mounted voice recognition method and system based on federal learning

Info

Publication number: CN116564339A
Application number: CN202310632039.4A
Authority: CN
Inventors: 盖乐; 孙彦博; 张俊伟
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-08

Abstract

The invention provides a safe and efficient car machine voice recognition system and a recognition method based on federal learning, which mainly solve the problems of data leakage and low voice recognition accuracy in the existing network communication. The system comprises a voice recognition module, a data encryption module and a distributed computing module. The voice recognition module is used for recognizing a voice signal of a driver recorded by the vehicle-mounted microphone into specific instruction data, and adopts a deep convolutional neural network and connection time sequence classification as an acoustic model and a maximum entropy hidden Markov model as a language model to realize voice recognition; the data encryption module is used for encrypting and decrypting the instruction transmitted between the client and the central terminal by using a homomorphic encryption algorithm under the federal learning framework, so that the data security is ensured; the distributed computing module is used for coordinating model training between the in-vehicle equipment and the cloud server to obtain updated voice recognition parameters. The invention avoids data leakage, improves the accuracy of voice recognition, and can be used for Internet of vehicles communication.

Description

Safe and efficient vehicle-mounted voice recognition method and system based on federal learning

Technical Field

The invention belongs to the field of Internet of vehicles, and further relates to a vehicle-mounted voice recognition system which can be used for realizing safer, more efficient and more accurate vehicle-mounted voice recognition in Internet of vehicles communication.

Background

Along with the continuous development of the future automobile industry, the intelligent trend is increasingly strong, and the vehicle-mounted voice recognition technology gradually becomes the standard of automobiles. However, privacy disclosure and security issues are also increasingly attracting attention when voice recognition technology is used in vehicles. The vehicle-mounted voice recognition technology is a technology widely applied to the fields of automobiles, transportation and the like, and can convert voice instructions of a driver into processable text information so as to realize wireless interaction between the driver and the vehicle. The technology utilizes a voice signal processing algorithm and a machine learning technology to analyze and identify the voice signal, and along with the continuous improvement of hardware and algorithm, the identification accuracy and response speed of the technology are obviously improved. Particularly, the application of the deep learning technology brings a qualitative leap to the voice recognition technology, and particularly the development of the deep neural network technology, so that the accuracy and the response speed of the vehicle-mounted voice recognition system reach a new height.

However, with the development of internet of vehicles, vehicle-mounted voice recognition systems also face some potential safety hazards, such as misrecognition, malicious instruction attack, and the like. In order to ensure the safety of a driver, a series of technical means are required for safety protection. Firstly, the voice signal processing is needed to be carried out, which comprises the steps of voice preprocessing, feature extraction, noise reduction and the like, so as to improve the accuracy and stability of voice recognition. Secondly, a powerful voice recognition model needs to be established, artificial intelligence technologies such as deep learning and the like are fully utilized, and the accuracy of voice recognition is improved. In addition, the validity of the voice command needs to be verified, so that the malicious command is prevented from being attacked. The technology can adopt modes of voice fingerprint recognition, voiceprint recognition and the like to verify the identity of the user, thereby ensuring the legality of the instruction. Meanwhile, the safety reinforcement and monitoring of the vehicle-mounted voice recognition system are further required to be enhanced, and potential safety risks are timely found and processed.

The application number of the acoustic research of the national academy of sciences is: patent application publication CN202210325453.6 discloses "a personalized federal learning method and system of speech recognition model", which comprises the following steps: performing supervised training on a voice recognition model by using a local labeled voice sample to obtain a seed model of the voice recognition model, wherein the seed model comprises a feature extractor and a classifier; and a second step of: the feature extractor is used for processing voice information input by a plurality of clients and determining personalized features of the voice information of each client in the plurality of clients; and a third step of: semi-supervised training is carried out on the classifier by utilizing the personalized features of the voice information of each client side in the plurality of client sides, so that a trained classifier is obtained; fourth step: and transmitting the trained classifier to each client in the plurality of clients. The method does not specify how to determine the personalized characteristics of the voice information of each client in the plurality of clients, so that a certain difficulty can be caused in practical application; meanwhile, the method only considers the mode of decoupling optimization between the feature extractor and the classifier of the voice recognition model to realize personalized federal learning, and omits other possible optimization modes, so that the privacy protection effect is not obvious; in addition, the method does not consider the isomerism and reliability of the client device, such as calculation capacity, network bandwidth, offline time and other factors, so that the effect and reliability in the practical application of federal learning are affected.

The application number of the Shanghai Lai automobile limited company is as follows: patent document CN201910491042.2 discloses a "user personalized offline speech recognition method and system for vehicle-mounted system", which comprises the following implementation scheme: the first step: collecting online automatic voice recognition results of users of the vehicle-mounted system; and a second step of: screening the online automatic voice recognition result to obtain the corpus of an online voice model; and a third step of: training the corpus of the online voice model to generate an online voice model; fourth step: and fusing the online voice model with a known universal language model to obtain a new language model. Although the offline speech recognition model can approximate to an online speech recognition result as much as possible after training and fusion, due to the influence of factors such as environment, speech quality and the like, a certain difference still exists between the offline recognition result and the online recognition result. The method still has the following two defects:

firstly, in order to generate a personalized offline voice recognition model of a user, online voice recognition results of the user need to be collected, and privacy problems of the user can be related.

Secondly, in order to generate the personalized offline speech recognition model, a plurality of processes of data collection, screening, training and fusion are required, and a great deal of time and calculation resources are consumed.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a safe and efficient vehicle-mounted voice recognition system and a recognition method based on federal learning, so as to protect the privacy of voice instructions of a vehicle owner, reduce the calculation cost, reduce the communication delay and improve the effect and the reliability of data protection.

The technical idea for realizing the purpose of the invention is as follows: the problem of voice recognition accuracy is solved through deep convolutional neural network and connection time sequence classification; the privacy protection of voice data and the low-delay communication are both realized through the homomorphic encryption asymmetric key design; by using the distributed computing method, the training efficiency of the speech recognition model is improved, and the computing overhead is reduced.

According to the above idea, the implementation steps of the invention include the following:

1. a safe and efficient car machine voice recognition system based on federal learning is characterized in that: comprising the following steps:

the voice recognition module adopts a deep convolutional neural network and connection time sequence classification as an acoustic model, and adopts a maximum entropy hidden Markov model as a language model to realize voice recognition, and is used for recognizing a voice signal of a vehicle-mounted microphone recording driver as specific instruction data;

the data encryption module is used for encrypting and decrypting the instruction data transmitted between the client and the central terminal by using an encryption algorithm based on homomorphic encryption under the federal learning framework, so that the safety of the data is ensured;

and the distributed computing module is used for coordinating model training between the in-vehicle equipment and the cloud server to obtain an updated voice recognition module.

Further, the acoustic model in the speech recognition module includes: deep convolutional neural networks and connection timing classifications;

the deep convolutional neural network comprises a plurality of convolutional layers and pooled layers, wherein the output characteristic diagram of each convolutional layer corresponds to the output characteristic diagram of one pooled layer, the convolutional layers extract the spatial characteristics of original input data through convolution operation, and the pooled layers reduce the size of the characteristic diagram through downsampling operation and reserve important characteristic information for extracting the characteristic representation of input voice;

the connection time sequence classification is used for fusing time sequence information and characteristic representation, processing misalignment between time sequence data and labels, learning the characteristic information extracted by the convolution pooling layer, and using the learned characteristic for classifying or identifying voice signals to generate a final voice identification result.

Further, the data encryption module uses an encryption algorithm based on homomorphic encryption to encrypt and decrypt the instruction data transmitted between the client and the central terminal, and the realization is as follows:

the client encrypts instruction data to be transmitted by using a public key, and the encrypted data is transmitted to the central end through a secure communication channel;

after the central terminal receives the encrypted instruction data, the central terminal executes instruction identification operation on the encrypted instruction data, and then returns an encryption result to the client terminal through a safe communication channel;

and the client decrypts the received encryption result by using the private key to obtain a specific instruction result sent by the driver.

Further, the distributed computing module coordinates model training between the in-vehicle device and the cloud server, and is realized as follows:

selecting a group of local devices to participate in federation learning, dividing a local data set of each device into a plurality of small batch data, loading parameters of a global model for each local device, and obtaining a parameter gradient;

all the parameter gradients are sent to a central server to be aggregated into a global gradient, the parameters of the global model are updated according to the global gradient,

the updated global model parameters are sent back to each local device,

repeating the above process until the global model converges or reaches a preset training round number, and obtaining the trained global model parameters.

2. Safe and efficient car machine voice recognition based on federal learning: the method comprises the following steps:

1) Segmentation pre-processing of voice data:

2) Extracting power spectrum characteristics of the pre-processed voice signal by using a Mel Frequency Cepstrum Coefficient (MFCC) voice characteristic extraction method, and extracting frequency domain characteristics of the voice signal by simulating the characteristics of a human ear auditory system;

3) Voice data noise processing:

carrying out noise removal processing on vehicle-mounted noise by adopting a spectral subtraction voice enhancement algorithm, decomposing an original signal, removing noise components, and reconstructing the amplitude of the enhanced voice signal

wherein ,to enhance the amplitude of the post-speech signal, |X _k | ² For the noisy speech after fourier transform, |n _k The I is an additive Gaussian noise signal after Fourier transformation; amplitude of enhanced speech signal +.>Sequentially performing phase processing and inverse Fourier transform for reconstructing the frequency spectrum by the phase information to obtain a noise-reduced voice signal;

4) Establishing a deep convolutional neural network CTC-CNN based on connection time sequence classification;

4a) Establishing an input layer with a two-dimensional matrix of speech signals as input:

4b) Selecting the size and the number of convolution kernels according to the size and the feature number of input data of an input layer, performing one-dimensional convolution operation on the input data and the convolution kernels, and calculating the features in the input data learned by a unit neuron through each position of the convolution kernels when the convolution kernels slide on the input data to construct the convolution layer;

4c) Respectively establishing a global pooling layer and a full connection layer;

4d) Sequentially connecting an input layer, a convolution layer, a global pooling layer and a full connection layer to form a deep convolution neural network CNN;

4e) Establishing a loss function of the connection time sequence classification CTC, and taking the loss function as a loss function L(s) of the CNN network to form a deep convolutional neural network CTC-CNN of the connection time sequence classification, wherein:

where p (z|x) represents the probability that a given input x, z is the output sequence; s is a data set;

5) Encrypting and decrypting the voice data:

5a) The encryption function E is defined as follows:

E:(m _d +F _k (i||j||d)-F _k (i||(j+1)||d))

wherein i and j are index of federal round and client associated with encryption process, d is serial number of data, '|' represents join operation, (i|j|d) represents join operation of identifiers i, j, d, (i| (j+1) |d) represents join operation of identifiers i, (j+1), d, F _k Representing the mapping of identifier i to data set Z _n Wherein m is _d Representing plaintext before encryption;

5b) Will beAnd a secret key k, substituted into the encryption function to obtain a threeTuple E _k (m)：

E _k (m)＝(c,i,j),

Wherein c in the triplet represents encrypted data;

5c) The decryption function D is defined as follows:

D:(c _d +∑(F _k (i||(j+1)||d)-F _k (i||j||d))),

wherein ,c_d Ciphertext data with a sequence number d after encryption;

5d) The encrypted voice data c is carried into a decryption function D, and the data m before encryption is obtained:

m＝D _k ((c,i,s)),

6) Aggregating the local model to a central server:

6a) Randomly selecting a group of local devices to participate in federal learning, and dividing a local data set of each device into a plurality of small batch data;

initializing a parameter theta of the global model, for each local device i epsilon {1, 2..once, n }, firstly loading the parameter theta of the global model on the local device, then calculating a loss function by using small-batch data on a local data set, and solving a parameter gradient

6b) Gradient parameters of all local devicesTransmitting to a central server; the central server gradient the parameters of all local devices +.>Aggregation into a global gradient->

N _i Is the data set size of the local device i, N is the total data set size of all local devices;

6c) Central server using global gradientsUpdating parameters theta of the global model:

wherein η is the learning rate;

6d) Sending the updated global model parameters theta back to each local device;

6e) Repeating steps 6 a) through 6 d) until the global model converges or a predetermined number of training rounds is reached.

7) And feeding back according to the identification result:

comparing the local model parameters with cloud server model parameters to determine whether model parameter updating is needed:

if the parameters are inconsistent, acquiring the latest model parameters from the cloud server by using an encrypted OTA technology to update the local model parameters;

otherwise, the result of the voice recognition is fed back to the driver.

Compared with the prior art, the invention has the following advantages:

according to the invention, the deep convolutional neural network and the connection time sequence classification method are used as acoustic models, and the maximum entropy hidden Markov model is used as a language model to realize voice recognition, so that the problem of voice recognition accuracy is fundamentally solved, noise interference can be removed according to different conditions, clean and undistorted target voice is achieved, and the accuracy of voice recognition is further improved.

And secondly, the invention only relates to modularized addition operation and random number in the encryption realization process due to the use of the homomorphic encryption-based asymmetric key design, thus obviously reducing the calculation overhead and the communication delay without weakening the privacy protection, overcoming the condition that the safety and the low delay in the prior art cannot be taken into account, and enabling the invention to meet the use requirement under the specific condition of the automobile and the automobile voice recognition.

Thirdly, the invention adopts the distributed computing idea of federal learning, namely, after model training is carried out by using local equipment, model parameters are sent to the central server, and the central server updates the parameters of the local model after parameter aggregation.

Fourth, the invention combines federal study and deep study, makes the best of the two, further improves the performance and efficiency of the model. The method is characterized in that a large amount of data are used for training through deep learning, and a complex model is fitted, so that high accuracy can be achieved; however, in practical situations, due to the influence of various factors, the training data may have problems of non-uniformity, noise and missing, which may lead to instability of the deep learning model; through federal learning, the model can be trained by utilizing the data on a plurality of devices, so that the deviation and noise of training data are effectively reduced, and the robustness of the model is improved.

Drawings

FIG. 1 is a system frame diagram of the present invention;

FIG. 2 is a schematic diagram of a speech recognition module in the system of the present invention

FIG. 3 is a flow chart of an implementation of the method of the present invention;

FIG. 4 is a schematic representation of an implementation of the method of the present invention for aggregating local patterns to a central server.

Detailed Description

Specific embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the federal learning-based safe and efficient vehicle-mounted voice recognition system comprises a voice recognition module 1, a data encryption module 2 and a distributed computing module 3. The voice recognition module 1 is respectively connected with the data encryption module 2 and the distributed computation module 3.

Referring to fig. 2, the voice recognition module 1 includes an acoustic model 11 and a language model 12, and is used for recognizing a voice signal of a vehicle microphone recording driver as specific instruction data. :

the acoustic model 11 is formed by connecting a deep convolutional neural network 111 and a connection time sequence classification unit 112, wherein:

the deep convolutional neural network 111 is formed by a plurality of convolutional layers and pooling layers alternately, an output characteristic diagram of each convolutional layer corresponds to an output characteristic diagram of one pooling layer, wherein the number and the size of convolutional kernels are dynamically determined through the size of an input two-dimensional matrix, the pooling layers perform nonlinear transformation by using a sigmoid activation function, the convolutional layers extract spatial characteristics of original input data, namely frequency spectrum information of a voice signal, through convolution operation, and the pooling layers reduce the size of the characteristic diagram through downsampling operation to reserve important characteristic information for extracting characteristic representation of input voice. In the processing process of the voice signal, the deep convolutional neural network can effectively extract the time domain and frequency domain characteristics of the voice signal, such as speech speed, intonation and pronunciation; meanwhile, the convolution layer and the pooling layer are used alternately, so that the size of the feature map can be gradually reduced, and the most important feature information is reserved, so that the representation capability and the robustness of the model to the voice signal are improved.

A connection timing classification unit 112, which represents input data that can be a recognition task of a subsequent voice instruction; the method is used for fusing time sequence information and characteristic representation, and in the processing process of the voice signals, the same word possibly has different lengths and time differences in different voice signals due to different pronunciation speeds and intonation of people; in order to solve the problems, the connection time sequence classification unit is introduced into the acoustic model, the misalignment problem between time sequence data and labels is processed, the feature information extracted by the convolution pooling layer is learned, the feature diagram extracted by the convolution pooling layer can be subjected to dimension reduction, important feature information is reserved while the dimension is reduced, and the time sequence information and the feature representation are fused through the learning weight matrix, so that a final voice recognition result is output. The method can effectively solve the problem of misalignment between the time sequence data and the label, and improves the generalization performance and the robustness of the model.

The language model 12 adopts a maximum entropy hidden Markov model, uses a characteristic function to represent different language characteristics, captures context information in a language sequence to accurately represent the characteristics and the distribution of the language, and learns parameter distribution adapted to a specific task according to constraint conditions; the model uses the principle of maximum entropy, namely, a model with the most uniform probability distribution is selected under a given condition; to achieve this goal, the model uses feature functions to represent different language features, such as parts of speech, shape of words, syntactic relations. In order to better capture the context information, the model considers the influence of the current state in the sequence on the future state, and the parameter distribution suitable for a specific task can be learned through training by constraint conditions, so that the performance and generalization capability of the model are improved.

The data encryption module 2 is used for encrypting and decrypting the instruction data transmitted between the client and the central terminal by using an encryption algorithm based on homomorphic encryption, so that the safety of the data is ensured; the homomorphic encryption is a special encryption technology, can operate on ciphertext under the condition that plaintext is not known, and obtains an encrypted result; homomorphic encryption is applied to encrypt and decrypt protection instruction data in this example. Through the encryption algorithm, instruction data transmitted between the client and the central server are strongly protected, and even if an attacker obtains the data, malicious operation or sensitive information theft cannot be performed on the data; meanwhile, the encryption algorithm based on homomorphic encryption can also improve data processing efficiency, reduce the pressure of network bandwidth and improve the stability and reliability of the system.

The distributed computing module 3 is used for coordinating model training between in-vehicle equipment and a cloud server to obtain an updated voice recognition module. In the distributed computing module, a federal learning method is adopted to divide a local data set on a plurality of devices in a vehicle, a global model is used as an initial model, a loss function is computed on each local device, and a parameter gradient is obtained; then, the parameter gradients of all the local devices are sent to a central server, the central server aggregates the parameter gradients to generate a global gradient, and the global gradient is used for updating the parameters of the global model; finally, the updated global model parameters are sent back to each local device; the updated speech recognition module is obtained by repeating this process until the global model converges or a predetermined number of training rounds is reached.

Referring to fig. 3, the implementation steps of the safe and efficient car machine voice recognition method based on federal learning in this example are as follows:

and step 1, pre-processing voice data.

The system records voice signals of a driver through the vehicle-mounted microphone, wherein the signals comprise sound, noise, echo and other interference;

the recorded speech signal is segmented using an endpoint detection algorithm, i.e. the speech signal is divided into individual speech units or frames by the following formula:

where s (N) represents the sample value of the original audio signal, N is the window length, E _t The energy value at time t is indicated.

And 2, extracting the characteristics of the voice data.

The step is to use Mel frequency cepstrum coefficient MFCC voice feature extraction method to extract the power spectrum feature of the pre-processed voice signal for the voice signal after preprocessing, wherein the realization is as follows:

2.1 The conversion of Mel frequency to Hz frequency is performed using the following equation:

wherein M (F) is a sensing frequency, and F is an actual frequency;

2.2 In MFCC feature extraction), the voice signal is subjected to short-time fourier transform, converted from the time domain to the frequency domain, and then the energy spectrum of the voice signal is filtered by using a Mel filter bank, so as to obtain the energy value of each Mel frequency band:

where m represents the number of the Mel filter, k represents a certain frequency point in the frequency domain, H _m (k) The response value of the Mel filter at the frequency point; f (m) is the center frequency of the mth Mel filter, and is obtained by interconverting Mel frequency and Hz frequency which are k points with equal spacing on Mel scale;

2.3 Performing discrete cosine transform on the energy value of each Mel frequency band to obtain MFCC coefficients of each Mel frequency band; finally, the MFCC coefficients are windowed and normalized to obtain a final MFCC feature representation.

And step 3, voice data noise processing.

In a special environment where a vehicle is traveling, the influence of noise on a voice signal is unavoidable, wherein the vehicle-mounted noise is a low-frequency signal whose frequency distribution is relatively stable. Based on the characteristics, the spectral subtraction becomes a very applicable voice enhancement algorithm, which can effectively improve the quality of voice signals, make the voice signals clearer and more natural and reduce noise interference. As a classical speech enhancement algorithm, spectral subtraction uses the characteristic that human ear hearing is insensitive to spectral phase and the uncorrelation of noise and speech in the frequency domain to perform noise suppression processing, and is specifically implemented as follows:

3.1 Let the noisy speech, clean speech and additive gaussian noise signals be x (n), s (n), n (n) respectively, and their relationships satisfy the following:

s(n)＝x(n)+n(n)

wherein n is a sampling time label, and satisfies that n is not less than 1 and not more than K, K is the frame length of the signal, the frame number is L, and l=1, …, L is the total frame length;

3.2 Based on the noisy speech x (n), the clean speech signal s (n) and the noise n (n), the amplitude of the enhanced speech signal is obtained

Fourier transforming the noisy speech X (n), the clean speech signal s (n) and the noise n (n) to obtain transformed noisy speech X _k Pure speech signal S _k And noise N _k, wherein ：

X _k ＝|X _k |exp(jθ _k ) Where j is an imaginary unit, θ _k X represents _k Is used to determine the phase angle of (c),

S _k ＝|S _k |exp(jα _k ) Where j is an imaginary unit, alpha _k Represent S _k Is used to determine the phase angle of (c),

the fourier coefficients are uncorrelated to give: x is X _k ＝S _k +N _k ；

The power spectrum of the voice with noise is calculated according to the modulus value:

wherein /> and />Expressed as a complex conjugate;

for |X _K | ² The mathematical expectation is calculated as follows:

E||X _K | ² |＝E||S _K | ² |+E||N _K | ² |；

the short-term stationary signal within any frame is expressed as:

wherein ,/>Is the noise section |N _k | ² Statistical flattening of (2)The average value;

substituting the short-time stationary signal into the mathematical expectation, and obtaining the amplitude of the enhanced voice signal after evolution

wherein ,|X_k | ² For the noisy speech after fourier transform, |n _k The I is an additive Gaussian noise signal after Fourier transformation;

3.3 For the amplitude of the enhanced speech signalAnd carrying out phase processing and inverse Fourier transformation on the frequency spectrum reconstruction according to the phase information to obtain a noise-reduced voice signal.

And 4, establishing a CTC-CNN model.

4.1 CNN model building:

the CNN model structure comprises a convolution layer, a pooling layer, a full connection layer and the like and is used for extracting characteristics of input data. In a convolution layer, filtering operation is carried out on input data through convolution operation, and local characteristics and spatial relations are captured, wherein the number and the size of convolution kernels are dynamically determined through the size of an input two-dimensional matrix; the pooling layer uses a Sigmoid activation function and a mean pooling method, so that the size of the feature map can be reduced, and the most important feature information can be extracted; the full connection layer connects the output of the pooling layer with all neurons of the previous layer, and is used for learning and processing the features extracted in the previous layer, and the connection weight relationships among the neurons are as follows:

assume that W is used _a(i)c(j) The connection weight between the ith neuron on the a-th feature plane of the input and the jth neuron on the c-th feature plane of the output is represented by:

W _a(i)c(j) ＝W _a(i+1)c(j+1) +W _a(i+2)c(j+2)

4.2 Coupling CNN model with CTC unit

And introducing null nodes, so that the voice frames do not need to be completely aligned, and optimizing the likelihood between the input sequence and the output target sequence by using a maximum likelihood function through CTC, wherein the formula is as follows:

in the formula ：

L(x,z)＝-lnp(z|x)

defining CTC loss functions as:

where p (z|x) represents the probability of outputting the sequence z given the input x; s is the training set, from which the CTC unit can find the output sequence with the highest probability given the input.

Fifthly, designing a voice data encryption and decryption function.

To increase security, an ordered tuple (i, j) needs to be selected for each ciphertext created, given a messageAnd a secret key k, the encryption function being defined as follows:

E _k (m)＝(c,i,j),

c _d ＝(m _d +F _k (i||j||d)-F _k (i||(j+1)||d)),

where i and j are the federal round associated with the encryption process and the index of the client, respectively, c represents the encrypted ciphertext, c _d The encrypted data representing the designated serial number, D is the serial number of the data, 1.ltoreq.d.ltoreq.d, D is the maximum serial number of the data subscript D, '|' represents the join operation, (i|j|d) represents the join operation of the identifiers i, j, D, (i| (j+1) |d) represents the join operation of the identifiers i, (j+1), D, and F _k Representing the identity of the object to be markedSymbol i maps to data set Z _n In (a) and (b);

the decryption function is defined as follows:

D _k ((c,i,S))＝m,

wherein ciphertext (c, i, S) is a parameter substituted into the decryption function, S is a range set of ciphertext c after encryption, m represents plaintext before encryption, m _d Plaintext data representing a specified sequence number, expressed as:

m _d ＝(c _d +∑(F _k (i||(j+1)||d)-F _k (i||j||d)))。

and step six, aggregating the local model to a central server.

Referring to fig. 4, this step is implemented as follows:

6.1 Selecting a group of local devices to participate in federal learning, dividing a local data set of each device into a plurality of small batch data, and loading parameters theta of a global model on the local devices;

6.2 For each local device i e {1,2,., n }, calculate a loss function of the speech recognition model with small batches of data on the local data set and find its parameter gradient

6.3 Gradient parameters of all local devicesTo the central server, which will gradient the parameters of all local devices +.>Aggregation into a global gradient->

wherein N_i Is the local device iData set size, N is the total data set size for all local devices;

6.4 Using global gradients by the central serverUpdating parameters theta of the global model:

where η is the learning rate.

6.3 Repeating the steps 6.1) to 6.4) until the global model converges or reaches the preset training round number, obtaining updated global model parameters theta' and sending back to each local device.

And step seven, the local equipment feeds back according to the identification result.

7.1 The local device verifies if there are updated global model parameters:

if yes, the verification is passed, and step 7.2) is executed;

if not, the verification is not passed, and the step eight is executed;

7.2 Matching the command or interaction result obtained after voice recognition with the preset command or interaction result, j is for each preset item o _i Calculate it and all processed keywords w _j And (3) weighting and normalizing the similarity, and finally selecting the preset item with the largest similarity sum as a matching result:

wherein O represents a preset instruction or interaction result set, and w _j Represents the weight of the j-th keyword obtained after the processing, sim (o _i ,w _j ) Represents o _i and w_j Similarity between;

7.3 According to the matching result, generating corresponding feedback content and outputting the feedback content to a driver, wherein the feedback content comprises sound generated by a voice synthesis technology, and image or text information on a vehicle-mounted display screen.

Step eight, updating the speech recognition model

In order to ensure that each vehicle can use the latest voice recognition model, the encryption OTA technology is adopted to issue new model parameters to each vehicle so as to realize automatic updating.

In this example, new model parameters are stored on a cloud server, and an update notification is sent to each vehicle, where the notification includes a download link for the new model parameters and some installation instructions, and when the vehicle receives the update notification, the vehicle can access the download link to obtain the latest version of the model parameters, and once the downloading of the model parameters is completed, an OTA program on the vehicle automatically installs the new model parameters locally, so that the new model parameters are immediately available, and updating of the speech recognition model is completed.

The above description is only one specific example of the invention and does not constitute any limitation of the invention, and it will be apparent to those skilled in the art that various modifications and changes in form and details may be made without departing from the principles, construction of the invention, but these modifications and changes based on the idea of the invention are within the scope of the claims of the invention.

Claims

the voice recognition module (1) is used for recognizing a voice signal of a vehicle-mounted microphone recorded driver into specific instruction data, adopts a deep convolutional neural network and connection time sequence classification as an acoustic model, and adopts a maximum entropy hidden Markov model as a language model to realize voice recognition;

the data encryption module (2) is used for encrypting, decrypting and protecting instruction data transmitted between the client and the central terminal by using an encryption algorithm based on homomorphic encryption under the federal learning framework, so that the safety of the data is ensured;

and the distributed computing module (3) is used for coordinating model training between the in-vehicle equipment and the cloud server to obtain updated voice recognition parameters.

2. The system according to claim 1, wherein: the acoustic model in the speech recognition module (1) comprises: deep convolutional neural networks and connection timing classifications;

3. The system according to claim 1, wherein: the speech recognition module (1) uses a maximum entropy hidden Markov model as a language model, uses a characteristic function to represent different language characteristics, captures context information in a language sequence to accurately represent the characteristics and the distribution of the language, and learns parameter distribution suitable for a specific task according to constraint conditions.

4. The system according to claim 1, wherein: the data encryption module (2) uses an encryption algorithm based on homomorphic encryption to encrypt and decrypt the instruction data transmitted between the client and the central terminal, and the realization is as follows:

5. The system according to claim 1, wherein: the distributed computing module (3) coordinates model training between in-vehicle equipment and a cloud server, and is realized as follows:

the updated global model parameters are sent back to each local device,

6. Safe and efficient car machine voice recognition based on federal learning: the method comprises the following steps:

1) Segmentation pre-processing of voice data:

3) Voice data noise processing:

5) Encrypting and decrypting the voice data:

5a) The encryption function E is defined as follows:

E:(m _d +F _k (i||j||d)-F _k (i||(i+1)||d))

5b) Will beAnd a secret key k, substituted into the encryption function to obtain a triplet E _k (m)：

E _k (m)＝(c,i,j),

Wherein c in the triplet represents ciphertext data with a sequence number d after encryption;

5c) The decryption function D is defined as follows:

D:(c _d +∑(F _k (i||(j+1)||d)-F _k (i||j||d))),

wherein ,c_d Representing the encrypted ciphertext;

m＝D _k ((c,i,s)),

6) Aggregating the local model to a central server:

wherein η is the learning rate;

7) And feeding back according to the identification result:

otherwise, the result of the voice recognition is fed back to the driver.

7. The method of claim 6, wherein step 1) the segmentation pre-processing of the voice data is performed as follows:

the vehicle voice recognition system records a voice signal of a driver through a vehicle-mounted microphone, wherein the signal comprises sound, noise, echo and other interference;

the recorded speech signal is partitioned into individual speech units or frames using an endpoint detection algorithm.

8. The method of claim 6, wherein the step 2) of extracting the power spectrum features of the pre-processed voice signal by using Mel-frequency cepstrum coefficient MFCC voice feature extraction method is implemented as follows:

2a) The conversion of Mel frequency to Hz frequency is performed using the following equation:

wherein M (F) is a sensing frequency, and F is an actual frequency;

2b) Performing short-time Fourier transform on the converted voice signal, and converting the voice signal from a time domain to a frequency domain to obtain an energy spectrum of the voice signal;

2c) Filtering the energy spectrum by using a Mel filter bank to obtain the energy value of each Mel frequency band:

where m represents the number of the Mel filter, k represents a certain frequency point in the frequency domain, hm (k) is the response value of the Mel filter at the frequency point, and f (m) is the center frequency of the mth Mel filter.

9. The method of claim 6, wherein step 4 c) establishes a global pooling layer and a full connection layer, respectively, as follows:

establishing a global pooling layer: and (3) averaging the characteristics of the convolution layer output by using an averaging pooling method to obtain a single scalar output, and finally outputting a one-dimensional vector as the input of the next layer.

Establishing a full connection layer: connecting all neurons of the previous layer with all neurons of the current layer, and mapping input features to output labels; after the weighted input is calculated, a sigmoid activation function is applied, and finally a vector is output to represent the probability distribution of each label.

10. The method of claim 6, wherein step 7) performs feedback according to the recognition result, implemented as follows:

firstly, according to the voice recognition result, matching the processed instruction or interaction result with a preset instruction or interaction result according to the following formula:

wherein C represents a preset instruction or interaction result set, w _j Represents the weight of the j-th keyword obtained after the processing, sim (c) _i ,w _j ) Representation c _i and w_j Similarity between;

and then, according to the matching result, generating corresponding feedback and outputting the feedback to a driver.