CN114004996A

CN114004996A - Abnormal sound detection method, abnormal sound detection device, electronic equipment and medium

Info

Publication number: CN114004996A
Application number: CN202111271257.7A
Authority: CN
Inventors: 刘建林; 解鑫; 许铭; 刘颖
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The present disclosure provides a method, an apparatus, an electronic device and a medium for abnormal sound detection, and particularly relates to the fields of artificial intelligence technologies such as speech technology, deep learning, intelligent manufacturing and industrial quality inspection. The specific implementation scheme is as follows: converting the audio data to be detected into a time-frequency diagram; slicing the time-frequency image to obtain an image sequence comprising a plurality of image blocks; extracting the features of the graph sequence to obtain a feature vector corresponding to the graph sequence; and determining the abnormal sound detection result of the audio data to be detected by identifying the abnormal sound characteristics in the characteristic vector. The detection result is more objective.

Description

Abnormal sound detection method, abnormal sound detection device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for detecting abnormal sounds, an electronic device, and a medium for the same.

Background

At present, in the delivery stage of an automobile seat motor, quality problems may exist in some motors, quality inspection is needed, the sound generated when the motor runs is collected generally, and a listener listens whether the collected sound is abnormal or not to determine whether the quality of the seat motor is in a problem or not.

Disclosure of Invention

The disclosure provides a method and a device for detecting abnormal sound, electronic equipment and a medium.

According to a first aspect of the present disclosure, there is provided an abnormal sound detection method, including:

converting the audio data to be detected into a time-frequency diagram;

slicing the time-frequency graph to obtain a graph sequence comprising a plurality of image blocks;

extracting features of the graph sequence to obtain a feature vector corresponding to the graph sequence;

and determining the abnormal sound detection result of the audio data to be detected by identifying the abnormal sound characteristics in the characteristic vector.

According to a second aspect of the present disclosure, there is provided an abnormal sound detection apparatus including:

the conversion module is used for converting the audio data to be detected into a time-frequency diagram;

the slicing module is used for slicing the time frequency graph to obtain a graph sequence comprising a plurality of image blocks;

the extraction module is used for extracting the features of the graph sequence to obtain a feature vector corresponding to the graph sequence;

and the determining module is used for determining the abnormal sound detection result of the audio data to be detected by identifying the abnormal sound characteristics in the characteristic vector.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is an exemplary diagram of an abnormal sound in audio data provided according to an embodiment of the present disclosure;

fig. 2 is a flowchart of an abnormal sound detection method provided in an embodiment of the present disclosure;

fig. 3 is a flowchart of another abnormal sound detection method provided by the embodiment of the present disclosure;

fig. 4 is a flowchart of another abnormal sound detection method provided by the embodiment of the present disclosure;

fig. 5 is an exemplary diagram of a time-frequency diagram provided by the embodiment of the disclosure;

fig. 6 is a flowchart of another abnormal sound detection method provided by the embodiment of the present disclosure;

fig. 7 is a flowchart of another abnormal sound detection method provided by the embodiment of the present disclosure;

fig. 8 is an exemplary diagram of a method for constructing a discriminant network according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an abnormal sound detection apparatus provided in the embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device for implementing the abnormal sound detection method according to the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

In the related technology, the quality of the motor of the automobile seat needs to be detected in the delivery stage, and the motor with the quality problem usually makes abnormal sound (abnormal sound for short) when running, so the current detection means mainly collects the sound of the motor when running, and then lets an experienced listener in a factory listen to the collected sound, and if the listener hears the abnormal sound, the quality problem of the motor of the seat is determined. However, in the actual detection process, the abnormal sound in the collected sound may exist for only a short moment and is not obvious, for example, the part in the rectangular box in fig. 1. For a listener, the sound which is not obvious at a moment is probably ignored, so that misjudgment is caused, further, a motor with quality problems leaves a factory, and the production yield is reduced.

In addition, for different listeners, the standards for judging whether abnormal sounds exist are different, and sometimes, for the same sound, different listeners have different conclusions, so that the evaluation standards of the motor quality are not uniform, and the yield is easy to fluctuate. Moreover, one production line of some large-scale factories can produce thousands of motors one day, and if quality judgment is carried out by means of manual sound listening, a large amount of labor cost can be consumed.

In order to avoid the influence of subjective factors on the evaluation of the motor sound, the embodiments of the present disclosure provide an abnormal sound detection method, an abnormal sound detection device, an electronic device, and a storage medium.

The abnormal sound detection method provided by the embodiment of the disclosure can be executed by an electronic device, and the electronic device can be a smart phone, a tablet computer, a desktop computer, a server and other devices.

The embodiment of the present disclosure provides an abnormal sound detection method, as shown in fig. 2, the method includes:

s201, converting the audio data to be detected into a time-frequency diagram.

The audio data to be detected in the embodiment of the present disclosure may be sound generated when the motor operates, or may also be sound generated when other devices besides the motor operate. The motor may be a car seat motor or other motor. In the embodiments of the present disclosure, a motor of a vehicle seat is taken as an example for explanation.

The time-frequency diagram represents the energy of the audio data to be detected in the time domain and the frequency domain.

S202, slicing the time-frequency map to obtain a map sequence comprising a plurality of image blocks.

The number of the image blocks can be set according to actual requirements. The time-frequency graph may be uniformly sliced by a preset number into a graph sequence including a plurality of tiles.

For example, a time-frequency graph may be uniformly divided into 8 tiles, and the 8 tiles are a graph sequence.

And S203, extracting the features of the graph sequence to obtain a feature vector corresponding to the graph sequence.

And S204, determining the abnormal sound detection result of the audio data to be detected by identifying the abnormal sound characteristics in the characteristic vector.

If the characteristic vector comprises abnormal sound characteristics, the abnormal sound detection result is abnormal, the fact that the audio to be detected comprises abnormal sound is indicated, and namely the motor corresponding to the audio to be detected has a fault.

If the characteristic vector does not include abnormal sound characteristics, the abnormal sound detection result is normal, and the fact that the abnormal sound is not included in the audio to be detected is indicated, namely that the motor corresponding to the audio to be detected does not have a fault.

By adopting the technical scheme, the audio data to be detected can be processed into the characteristic vector, the abnormal sound detection result of the audio data to be detected is determined by identifying the abnormal sound characteristic in the characteristic vector, the automatic identification of the audio data to be detected is realized, a master listening device is not required, and the abnormal sound detection result has objectivity.

In another embodiment of the present disclosure, before executing the process of fig. 2, the audio data to be detected is acquired. As shown in fig. 3, before S201, the method further includes:

s301, collecting audio data.

The electronic equipment can collect audio data generated when the equipment to be tested operates.

In one embodiment, the device to be tested is a car seat motor, then this step can be implemented as: and collecting audio data generated when the motor of the automobile seat operates.

In a scene of factory detection of the motor of the automobile seat, audio data generated when each produced motor is in test operation can be collected.

S302, processing the collected audio data into a preset length to obtain audio data to be detected.

When audio data are collected, the lengths of the audio data collected by the motors are not completely the same under the influence of factors such as production beats of the motors, test duration and the like, so that all the collected audio data can be processed into a uniform preset length.

For example, the preset length may be 6 seconds.

For audio data having a duration greater than a preset length, a portion of the audio data may be deleted such that the remaining audio data is equal to the preset length.

A part of the audio data may be deleted at random, or a part of each of both ends of the audio data may be deleted.

For example, if the preset length is 6 seconds and the duration of the acquired audio data is 7 seconds, the beginning 0.5 seconds and the end 0.5 seconds of the audio data are deleted. Because the motor rotates in a circle when running, if the motor has a fault, the sound generated by the complete circle of the running of the motor comprises abnormal sound. In the collected audio data, the initial part and the final part may not include sound generated by a complete rotation of the motor, so the initial part and the final part can be preferentially deleted to improve the detection effectiveness.

For the audio data with the duration less than the preset length, the audio data can be filled to the preset length. Alternatively, a portion may be copied from the audio data and refilled to the filling location.

For example, if the preset length is 6 seconds and the duration of the acquired audio data is 5 seconds, one second of the 5 seconds may be copied to fill in before or after the acquired audio data, or the first 0.5 second of the copied second may be filled in before the acquired audio data and the last 0.5 second may be filled in after the acquired audio data.

By adopting the method, the collected audio data is processed into the preset length, so that the subsequent calculation of the audio data with uniform length can be facilitated, the test standards of all the audio data to be detected are more consistent, and quality disputes are avoided.

Under the condition that the audio data to be detected are the motors of the automobile seat, the testing standards of the motors can be more consistent, disputes to the quality of the motors are avoided, and the yield of the produced motors is prevented from fluctuating.

In another embodiment of the present disclosure, as shown in fig. 4, on the basis of the above embodiment, S201 may specifically be implemented as:

and S2011, performing Fourier transform on the audio data to be detected.

In the embodiment of the present disclosure, the audio data to be detected may be processed through short-time Fourier transform (STFT), and optionally, a hanning (hanning) window of Fourier transform may be used, the window length being 512 and the step size being 256.

Through Fourier transformation, the audio data to be detected can be converted into a time-frequency diagram from the audio signal.

S2012, filtering the Fourier transform result to obtain a time-frequency diagram corresponding to the audio data to be detected.

In order to filter redundant information in the time-frequency diagram obtained by the fourier transform, the time-frequency diagram obtained by the fourier transform may be filtered by a filter.

Optionally, the filtering may be performed using a MEL filter bank, and the MEL filter bank may include 64 MEL filters.

As an example, a time-frequency diagram obtained by filtering through the MEL filter bank is shown in fig. 5, and the lines in fig. 5 are used to represent the sound features in the audio data to be detected. .

By adopting the method, the audio data to be detected is subjected to Fourier transform, a time-frequency graph convenient for subsequent feature extraction can be obtained, the Fourier transform result is filtered, the influence of redundant information on the subsequent detection result can be filtered, and the accuracy of abnormal sound detection is improved.

In another embodiment of the present disclosure, as shown in fig. 6, on the basis of any one of the above embodiments, S203 may specifically be implemented as:

s2031, extracting the image block characteristics and the position characteristics of each image block included in the image sequence.

The electronic device may perform feature embedding on each tile included in the sequence of the graphs to obtain a feature of the tile corresponding to each tile. And embedding the positions of all the image blocks included in the image sequence to obtain the position characteristics corresponding to all the image blocks.

The feature embedding refers to converting data represented by the image blocks into digital vectors with fixed sizes so as to facilitate subsequent calculation. Position embedding refers to extracting the position feature of each tile in the tile sequence, which is also in the form of a number vector.

The feature embedding and the position embedding can be performed through a feature embedding network and a position embedding network, and the output sizes of the feature embedding network and the position embedding network used in the embodiment of the disclosure are completely the same, that is, the dimensions of the tile feature and the position feature corresponding to each tile are completely the same.

S2032, splicing the image block characteristics and the position characteristics of all the image blocks included in the image sequence to obtain the characteristic vector corresponding to the image sequence.

In the embodiment of the disclosure, the feature vector obtained by splicing the feature of the image block and the position feature can reflect the audio feature represented by each image block, and can also reflect which segment of the audio to be detected is represented by each image block, so that the position of the abnormal sound segment can be determined by calculating the feature vector.

After obtaining the feature vector of the graph sequence, in another embodiment of the present disclosure, as shown in fig. 7, based on any of the above embodiments, S204 may be specifically implemented as:

s2041, inputting the feature vector into a pre-trained discrimination network, so that the discrimination network recognizes abnormal sound features in the feature vector to obtain an abnormal sound detection result.

The judgment network comprises a plurality of coding modules and a result mapping module which are sequentially connected, and each coding module comprises a feedforward network layer and a multi-head attention layer. The discriminant network in the embodiments of the present disclosure is a neural network supporting serialized input, and may include 6 coding modules, as an example. The discriminating network can be a transform network, and the encoding module can be an encoder module in the transform. Because the format of the audio data to be detected is related to time and can be regarded as a time sequence, abnormal sound detection can be accurately carried out through a transformer network supporting serialization input.

A multi-head attention (multi-head attention) layer for computing an attention vector based on the received feature vectors. The feature vector is obtained by splicing the features of the image blocks and the position features, so the feature vector keeps the precedence information of the image blocks in the image sequence. When the multi-head attention layer calculates the attention vector, the feature which has the largest influence on the judgment result in the feature vector, namely the abnormal sound feature is captured by determining the position of the audio data to be processed where the feature with high attention weight is located. That is, in the embodiment of the present disclosure, the electronic device may capture the characteristics of the abnormal sound segment in the audio data to be processed through the multi-head attention layer in the multiple encoding modules. The attention vector can reflect whether the characteristic vector of the audio to be processed comprises abnormal sound characteristics.

The attention vectors calculated by the multi-head attention layer are input into the feedforward network layer.

And the feedforward network layer is used for performing feature calculation based on the attention vector to obtain a calculation result, and the calculation result is also a feature vector.

The feature vector output by the feedforward network layer is input into the multi-head attention layer of the next coding module, and the calculation result output by the last coding module is input into the result mapping layer.

And the result mapping layer is used for converting the calculation result output by the last coding module into the abnormal sound detection result.

The calculation result output by the feedforward network layer is a high-dimensional vector, and the result mapping layer can map the high-dimensional vector into a [1 x 2] dimensional vector and output the vector. The vector of the [1 × 2] dimension is an abnormal sound detection result, and the abnormal sound detection result includes the probability that the audio data to be detected is normal and the probability of abnormality.

S2042, obtaining the abnormal sound detection result output by the judgment network.

By adopting the method, the characteristic vector can be calculated through the pre-trained discrimination network, so that the abnormal sound detection result is obtained, the batch detection of the audio data to be detected can be realized, the manual detection of a listener is not needed, and a large amount of labor cost can be reduced.

In addition, the motors needing factory detection can be detected by adopting the method, subjective factors cannot be mixed in the detection through a judgment network, detection standards are unified, disputes are avoided, and the production yield cannot fluctuate greatly.

In addition, the listening master listens to the abnormal sound generated by the motor by the ears of the user, the time is short, the abnormal sound which is difficult to perceive is easy to miss, the abnormal sound characteristic can be accurately identified by judging the multi-head self-attention layer and the feedforward network layer which are included in the network, the condition of judgment error can be avoided, the detection accuracy is high, the produced fault motor can be timely found, and the delivery of defective products is reduced.

Optionally, the discriminant Network of the embodiment of the present disclosure may also be a Long Short-Term Memory Network (LSTM) or a Recurrent Neural Network (RNN).

Before the method flow of the above embodiment is executed, the discriminant network needs to be trained.

First, a plurality of raw audio data are collected. Because the data size of the input discrimination network is uniform, in order to meet the input requirement of the discrimination network, each original audio data needs to be processed to a preset length to obtain sample audio data. I.e. processing a plurality of original audio data to the same length. The specific processing method may refer to the related description in S302 above.

Currently, when a car seat motor is detected, the length of the collected audio is about 6 seconds, so that each original audio data can be processed into 6 seconds, and when the collection frequency of the audio data is 12800hz, the dimension of the audio data after the processing is [1 × 76800 ].

In addition, the auditor master is also required to judge whether abnormal sounds are included in each sample audio data, so as to label each sample audio data. And if the sample audio data comprises abnormal sounds, the labeled label is abnormal, namely the sample audio data is a negative sample. If the sample audio data does not include the abnormal sound, the labeled label is normal, that is, the sample audio data is a positive sample. For the subsequent training process, the tag needs to be converted into a one-hot form.

After the sample audio data is obtained, feature processing needs to be performed on the sample audio data.

The characteristic processing process comprises the following steps: and carrying out short-time Fourier transform and filtering processing on each sample audio data to obtain a time-frequency graph corresponding to each sample audio data.

When the sample audio data is subjected to short-time fourier transform, the sample audio data needs to be subjected to framing processing, which is equivalent to processing the sample audio data into a string of continuous features related to time dimension, and the data subjected to the framing processing is subjected to frequency domain analysis and can be processed into a time-frequency graph including high-dimensional features.

Because the labels of the sample audio data are manually labeled and are influenced by the auditory characteristics of human ears, and some characteristics included in the sample audio data are redundant information which can not be received by human ears, the redundant information also exists in the time-frequency diagram obtained by Fourier transform, and the redundant data can be filtered by filtering the time-frequency diagram obtained by Fourier transform, so that the interference of the redundant data identified by a subsequent discrimination network on a detection result is avoided.

After the feature processing process is completed, training of the discrimination network may be performed.

Since the format of the sample audio data is time-dependent and can be regarded as a time sequence, the discriminant network can be constructed by using an encoder structure of a transform network in the Natural Language Processing (NLP) field in the embodiment of the present disclosure.

As shown in fig. 8, the filtered time-frequency graph may be sliced for each sample audio data, and a graph sequence including 8 tiles may be obtained by taking slicing into 8 tiles in fig. 8 as an example.

And then, respectively embedding features and embedding positions of each image block, and performing connect splicing on the results of the feature embedding and the position embedding to obtain a feature vector corresponding to the sample audio data.

And then inputting the feature vector corresponding to the sample audio data into an encoder module of a transform network, and performing result mapping on the high-dimensional vector output by the encoder module to obtain a [1 × 2] dimensional vector, wherein the [1 × 2] dimensional vector is a judgment result and comprises the normal probability and the abnormal probability of the sample audio data. For example, if the normal probability is 1% and the abnormal probability is 99%, the judgment network identifies that the sample audio data is an abnormal audio.

Further, a loss function value may be calculated based on the sample audio data corresponding to the discrimination result and the label, and a parameter of the discrimination network may be adjusted based on the loss function value, thereby completing training of the discrimination network.

Specifically, the loss function in the embodiment of the present disclosure may be cross entropy loss L_cross(Y,P)，L_cross(Y, P) is used to indicate the proximity between the discrimination result output by the discrimination network and the expected result (label), L_crossThe smaller the (Y, P) value is, the more accurate the result output by the judgment network is, and the L can be minimized by a random gradient descent method_crossThe method of (Y, P) adjusts neural network parameters in the discriminative network.

Wherein the content of the first and second substances,

wherein y is_i,kTags, p, representing the one-hot form of sample audio data_i,kRepresenting the result of discrimination output from the discriminating network, i.e. [ 1X 2]]A vector of dimensions. By y_i,kAnd p_i,kCalculating L_cross(Y, P), further according to L_crossThe (Y, P) value adjusts the parameters of the discrimination network until L_crossAnd (Y, P) when the training target is met, using the discriminant network obtained by training as a discriminant network for abnormal sound detection.

After the trained discrimination network is obtained, in an application stage, the audio data of the motor can be collected, the collected audio data is processed to be a preset length to obtain audio data to be detected, and then Fourier transform and MEL filtering processing are carried out on the audio to be detected to obtain a time-frequency graph corresponding to the audio data to be detected.

Then, referring to fig. 8, the time-frequency graph corresponding to the audio data to be detected is divided into graph sequences, each graph block included in the graph sequences is subjected to feature embedding and position embedding, and the obtained graph block features and position features are spliced to obtain feature vectors of the audio data to be detected.

And then, inputting the feature vector of the audio data to be detected into a transducer encoder module in the discrimination network, wherein the trained transducer encoder module in the discrimination network can accurately identify the abnormal sound feature in the feature vector and output the identification result in a high-dimensional vector form. And then, carrying out result mapping on the vector output by the transformer encoder module through a result mapping module in the discrimination network to obtain a discrimination result of [1 x 2] dimension.

The judgment result comprises the normal probability and the abnormal probability of the audio data to be detected. And if the normal probability is greater than the abnormal probability, determining that the audio data to be detected does not contain abnormal sound, and enabling the quality of the motor corresponding to the audio data to be detected to meet the factory standard. And if the abnormal probability is greater than the normal probability, determining that the audio data to be detected comprises abnormal sound, and determining that the quality of the motor corresponding to the audio data to be detected does not meet the delivery standard.

Corresponding to the above method embodiment, as shown in fig. 9, the present disclosure further provides an abnormal sound detection apparatus, including:

a conversion module 901, configured to convert the audio data to be detected into a time-frequency diagram;

a slicing module 902, configured to slice the time-frequency graph to obtain a graph sequence including multiple tiles;

an extraction module 903, configured to perform feature extraction on the graph sequence to obtain a feature vector corresponding to the graph sequence;

the determining module 904 is configured to determine an abnormal sound detection result of the audio data to be detected by identifying abnormal sound features in the feature vector.

In another embodiment of the present disclosure, the converting module 901 is further configured to:

carrying out Fourier transform on audio data to be detected;

and filtering the Fourier transform result to obtain a time-frequency graph corresponding to the audio data to be detected.

In another embodiment of the present disclosure, the extracting module 903 is further configured to:

extracting the image block characteristics and the position characteristics of each image block included in the image sequence;

and splicing the image block characteristics and the position characteristics of all image blocks included in the image sequence to obtain the characteristic vector corresponding to the image sequence.

In another embodiment of the present disclosure, the determining module 904 is further configured to:

inputting the feature vector into a pre-trained discrimination network so that the discrimination network identifies abnormal sound features in the feature vector to obtain an abnormal sound detection result;

and obtaining the abnormal sound detection result output by the judgment network.

In another embodiment of the disclosure, the discrimination network comprises a plurality of coding modules and a result mapping module which are connected in sequence, and each coding module comprises a feedforward network layer and a multi-head attention layer;

a multi-head attention layer for calculating an attention vector based on the received feature vectors;

the feedforward network layer is used for carrying out feature calculation based on the attention vector to obtain a calculation result;

and the result mapping layer is used for converting the calculation result output by the last coding module into an abnormal sound detection result.

In another embodiment of the present disclosure, the discrimination network is a neural network that supports serialized inputs.

In another embodiment of the present disclosure, the apparatus further comprises:

the acquisition module is used for acquiring audio data;

and the processing module is used for processing the acquired audio data into preset length to obtain the audio data to be detected.

In another embodiment of the present disclosure, the acquisition module is further configured to:

and collecting audio data generated when the motor of the automobile seat operates.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as the abnormal sound detection method. For example, in some embodiments, the abnormal sound detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the abnormal sound detection method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the alien-tone detection method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An abnormal sound detection method comprises the following steps:

converting the audio data to be detected into a time-frequency diagram;

2. The method of claim 1, wherein the converting the audio data to be detected into a time-frequency diagram comprises:

carrying out Fourier transform on the audio data to be detected;

3. The method of claim 1, wherein extracting features of the graph sequence to obtain a feature vector corresponding to the graph sequence comprises:

4. The method according to any one of claims 1 to 3, wherein determining the abnormal sound detection result of the audio data to be detected by identifying the abnormal sound feature in the feature vector comprises:

and obtaining the abnormal sound detection result output by the discrimination network.

5. The method of claim 4, wherein the discriminant network comprises a plurality of encoding modules and a result mapping module connected in sequence, each encoding module comprising a feedforward network layer and a multi-head attention layer;

the multi-head attention layer is used for calculating an attention vector based on the received feature vectors;

the feedforward network layer is used for performing feature calculation based on the attention vector to obtain a calculation result;

6. The method of claim 4, wherein the discriminative network is a neural network that supports serialized inputs.

7. The method according to any of claims 1-3, before converting the audio data to be detected into a time-frequency diagram, the method further comprising:

collecting audio data;

and processing the collected audio data into a preset length to obtain the audio data to be detected.

8. The method of claim 7, wherein the capturing audio data comprises:

9. An abnormal sound detection device comprising:

10. The apparatus of claim 9, wherein the conversion module is further configured to:

carrying out Fourier transform on the audio data to be detected;

11. The apparatus of claim 9, wherein the extraction module is further configured to:

12. The apparatus of any of claims 9-11, wherein the means for determining is further configured to:

13. The apparatus of claim 12, wherein the discriminant network comprises a plurality of encoding modules and a result mapping module connected in sequence, each encoding module comprising a feedforward network layer and a multi-head attention layer;

14. The apparatus of claim 12, wherein the discriminative network is a neural network that supports serialized inputs.

15. The apparatus of any of claims 9-11, further comprising:

the acquisition module is used for acquiring audio data;

and the processing module is used for processing the acquired audio data into a preset length to obtain the audio data to be detected.

16. The apparatus of claim 15, wherein,

the acquisition module is also used for acquiring audio data generated when the motor of the automobile seat operates.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.