CN109087669A

CN109087669A - Audio similarity detection method, device, storage medium and computer equipment

Info

Publication number: CN109087669A
Application number: CN201811233515.0A
Authority: CN
Inventors: 陈均; 赵旭峰; 沈锦龙; 樊征
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2018-12-25
Anticipated expiration: 2038-10-23
Also published as: CN112863547A; CN112863547B; CN109087669B

Abstract

The embodiment of the invention discloses a kind of audio similarity detection method, device, storage medium and computer equipment, the available audio to be detected of the embodiment of the present invention；The audio for meeting preset condition is filtered out from the audio to be detected, and the characteristic sequence of the audio to be detected is obtained according to the audio filtered out；Obtain the reference characteristic sequence of benchmark audio；Similarity distance between the characteristic sequence for obtaining the audio to be detected, and the reference characteristic sequence of the benchmark audio；The similarity between the audio to be detected and benchmark audio is determined according to the similarity distance.Interference tones in audio to be detected can be filtered and be filtered out required audio frequency characteristics by the program, and can reduce influence of many factors to similarity testing result, improve the accuracy of audio similarity detection.

Description

Audio similarity detection method, device, storage medium and computer equipment

Technical field

The present invention relates to technical field of data processing, and in particular to a kind of audio similarity detection method, device, storage are situated between Matter and computer equipment.

Background technique

With the development of science and technology, people's lives are more and more abundant, for example, user can not only appreciate music and video display etc. Audio can also imitate the audio and be entertained, and need the audio imitated user to be compared with original audio, at this time to comment Estimate the similarity of imitation.

In the prior art, for imitating song, during detecting audio similarity, firstly, acquisition user imitates Audio, and be mixed with original singer's audio of audio accompaniment, then directly calculate between the audio and original singer's audio that user imitates Similarity.However, directly being calculated so similar since the audio that original singer's audio and user imitate is influenced by more multifactor Degree can generate biggish error, and the similarity accuracy caused is lower.

Summary of the invention

The embodiment of the present invention provides a kind of audio similarity detection method, device, storage medium and computer equipment, it is intended to Improve the accuracy of audio similarity detection.

In order to solve the above technical problems, the embodiment of the present invention the following technical schemes are provided:

A kind of audio similarity detection method, comprising:

Obtain audio to be detected；

The audio for meeting preset condition is filtered out from the audio to be detected, and according to the audio acquisition filtered out The characteristic sequence of audio to be detected；

Obtain the reference characteristic sequence of benchmark audio；

The characteristic sequence for obtaining the audio to be detected, between the reference characteristic sequence of the benchmark audio it is similar away from From；

The similarity between the audio to be detected and benchmark audio is determined according to the similarity distance.

A kind of audio similarity detection device, comprising:

Audio acquiring unit, for obtaining audio to be detected；

Screening unit, for filtering out the audio for meeting preset condition from the audio to be detected, and according to filtering out Audio obtain the characteristic sequence of the audio to be detected；

Feature acquiring unit, for obtaining the reference characteristic sequence of benchmark audio；

Distance acquiring unit, it is special with the benchmark of the benchmark audio for obtaining the characteristic sequence of the audio to be detected Levy the similarity distance between sequence；

Determination unit, it is similar between the audio to be detected and benchmark audio for being determined according to the similarity distance Degree.

Optionally, the screening unit includes:

It handles subelement and obtains pretreated audio for pre-processing to the audio to be detected；

Subelement is obtained, for obtaining the energy spectrum of the pretreated audio；

Subelement is screened, for filtering out the default item of satisfaction from the pretreated audio according to the energy spectrum The audio of part, and set the corresponding frequency sequence of the audio filtered out to the characteristic sequence of the audio to be detected.

Optionally, the processing subelement is specifically used for:

The audio to be detected is sampled according to default sampling policy, the audio after being sampled；

Sub-frame processing is carried out to the audio after the sampling according to default framing strategy, the audio after obtaining framing；

Windowing process is carried out to the audio after the framing, obtains the pretreated audio of discrete time-domain.

Optionally, the acquisition subelement is specifically used for:

Integral transformation is carried out to the pretreated audio, obtains the corresponding frequency spectrum of the pretreated audio；

The energy spectrum of the pretreated audio is determined according to the frequency spectrum.

Optionally, the screening subelement includes:

Module is obtained, for obtaining the intensity of sound of the audio to be detected according to the energy spectrum；

Screening module, the audio for being greater than preset threshold for filtering out intensity of sound from the audio to be detected, obtains Intensity of sound meets the audio of preset condition.

Optionally, the screening module is specifically used for:

The intensity of sound of the audio to be detected is normalized into preset sound strength range, obtains intensity of sound standardization Audio；

The audio that intensity of sound is greater than preset threshold is filtered out from the intensity of sound standardized audio, and it is strong to obtain sound Degree meets the audio of the preset condition.

Optionally, when in the benchmark audio including target fiducials audio and interference tones, the feature acquiring unit Include:

Mean value obtains subelement, for obtaining the first root mean square average energy value of the target fiducials audio, and acquisition Second root mean square average energy value of the interference tones；

Energy spectrum obtains subelement, for obtaining the first energy spectrum of the target fiducials audio, and obtains described dry Disturb the second energy spectrum of audio；

Optimize subelement, for equal according to first energy spectrum, the first root mean square average energy value, the second root mean square energy Value and the second energy spectrum, optimize the benchmark audio, the benchmark audio after being optimized；

Feature obtains subelement, for obtaining the reference characteristic sequence of the benchmark audio after the optimization.

Optionally, the mean value obtains subelement and is specifically used for:

It determines the first root mean square energy of the target fiducials audio, and determines the second root mean square of the interference tones Energy；

The first frame number and the first frame length of the target fiducials audio are obtained, and obtains the second frame of the interference tones Several and the second frame length；

According to the first root mean square energy, the first frame number and the first frame length determine the target fiducials audio first Root average energy value, and the interference tones are determined according to the second root mean square energy, the second frame number and the second frame length Second root mean square average energy value.

Optionally, the distance acquiring unit includes:

Coded sub-units are obtained for encoding according to characteristic sequence of the pre-arranged code strategy to the audio to be detected Characteristic sequence to after the first coding, and according to the pre-arranged code strategy to the reference characteristic sequence of the benchmark audio into Row coding, the characteristic sequence after obtaining the second coding；

First determine subelement, for determine it is described first coding after characteristic sequence and second coding after characteristic sequence Between similarity distance.

Optionally, the coded sub-units are specifically used for:

According to pre-arranged code strategy by the characteristic sequence of the audio to be detected, each adjacent two characteristic value carries out size Compare；

When characteristic value previous in two neighboring characteristic value is less than later feature value, by the spy of the audio to be detected Sign sequential coding is the first encoded radio, and,

When characteristic value previous in two neighboring characteristic value is equal to later feature value, by the spy of the audio to be detected Sign sequential coding is the second encoded radio；And

When characteristic value previous in two neighboring characteristic value is greater than later feature value, by the spy of the audio to be detected Sign sequential coding is third encoded radio；

Characteristic sequence after generating the first coding based on the first encoded radio, the second encoded radio and/or third encoded radio.

Optionally, the similarity distance includes at least editing distance, Euclidean distance and Hamming distance, and described first really Stator unit is specifically used for:

At least determine it is described first coding after characteristic sequence and second coding after characteristic sequence between editing distance, Euclidean distance and Hamming distance；

The editing distance, Euclidean distance and Hamming distance are normalized respectively, obtain similarity distance.

Optionally, the determination unit includes:

Subelement is constructed, for constructing each distance and sub- similarity in editing distance, Euclidean distance and Hamming distance Between affine function；

Determine subelement, for according to it is each determine respectively apart from corresponding affine function it is each apart from corresponding sub- similarity；

Third determines subelement, for being determined between the audio to be detected and benchmark audio according to the sub- similarity Similarity.

Optionally, the third determines that subelement is specifically used for:

The first weighted value is set for the sub- similarity of the editing distance, and is arranged for the sub- similarity of the Hamming distance Second weighted value；

Penalty term is set by the sub- similarity of the Euclidean distance；

According to first weighted value, the second weighted value and penalty term, determine the audio to be detected and benchmark audio it Between similarity.

Optionally, the audio similarity detection device further include:

Resource transfers unit, for being greater than default similarity when the similarity between the audio to be detected and benchmark audio When threshold value, virtual resource transfer operation is executed, and/or the related of similarity testing result of the display audio to be detected is believed Breath.

Optionally, the audio similarity detection device further include:

Unlocking unit, for being greater than default similarity threshold when the similarity between the audio to be detected and benchmark audio When, audio lock operation is unlocked in execution.

A kind of storage medium, the storage medium are stored with a plurality of instruction, and described instruction is suitable for processor and is loaded, with Execute any audio similarity detection method provided in an embodiment of the present invention.

A kind of computer equipment, including memory and processor, the memory are stored with determining machine program, the determination When machine program is executed by the processor, so that the processor executes any audio similarity provided in an embodiment of the present invention Detection method.

The available audio to be detected of the embodiment of the present invention, and filtered out from the audio to be detected and meet preset condition Audio, and the characteristic sequence of audio to be detected is obtained according to the audio filtered out, so as to will be dry in audio to be detected It disturbs audio and is filtered and filters out required audio frequency characteristics, and obtain the reference characteristic sequence of benchmark audio；Then, it obtains Similarity distance between the characteristic sequence of audio to be detected, and the reference characteristic sequence of benchmark audio, such as editing distance, Europe are several In distance and Hamming distance etc., which can reduce influence of many factors to similarity testing result, at this time may be used To determine the similarity between audio to be detected and benchmark audio according to similarity distance, the accurate of audio similarity detection is improved Property.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the schematic diagram of a scenario of audio similarity detection method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of audio similarity detection method provided in an embodiment of the present invention；

Fig. 3 is another flow diagram of audio similarity detection method provided in an embodiment of the present invention；

Fig. 4 is another flow diagram of audio similarity detection method provided in an embodiment of the present invention；

Fig. 5 is the schematic diagram that terminal provided in an embodiment of the present invention shows K song interface；

Fig. 6 (a) to 6 (d) is initial time domain sample graph provided in an embodiment of the present invention；

Fig. 7 (a) to 7 (d) is spectrum signature figure provided in an embodiment of the present invention；

Fig. 8 is the flow diagram provided in an embodiment of the present invention for obtaining characteristic sequence；

Fig. 9 is the flow diagram of screening frequency sequence provided in an embodiment of the present invention；

Figure 10 (a) to 10 (d) is the spectrum signature figure provided in an embodiment of the present invention after characteristic filter；

Figure 11 (a) to 11 (c) is the schematic diagram of the first dimensional feature sequence provided in an embodiment of the present invention；

Figure 12 (a) to 12 (c) is the schematic diagram of the first coding sign sequence provided in an embodiment of the present invention；

Figure 13 is the schematic diagram that terminal provided in an embodiment of the present invention shows the red packet amount of money and song；

Figure 14 is the schematic diagram of terminal display reminding user music composition for two or more information provided in an embodiment of the present invention；

Figure 15 is the schematic diagram that terminal provided in an embodiment of the present invention shows speech message；

Figure 16 is the structural schematic diagram of audio similarity detection device provided in an embodiment of the present invention；

Figure 17 is another structural schematic diagram of audio similarity detection device provided in an embodiment of the present invention；

Figure 18 is another structural schematic diagram of audio similarity detection device provided in an embodiment of the present invention；

Figure 19 is another structural schematic diagram of audio similarity detection device provided in an embodiment of the present invention；

Figure 20 is another structural schematic diagram of audio similarity detection device provided in an embodiment of the present invention；

Figure 21 is the structural schematic diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall within the protection scope of the present invention.

The embodiment of the present invention provides a kind of audio similarity detection method, device, storage medium and computer equipment.

Referring to Fig. 1, Fig. 1 is the schematic diagram of a scenario of audio similarity detection method provided by the embodiment of the present invention, it should Audio similarity detection method can be applied to audio similarity detection device, which can specifically collect At tablet computer, mobile phone and laptop etc. have storage element and microprocessor is installed and with operational capability In terminal, for example, the available audio to be detected of the terminal, which can be the audio etc. of user recording generation, It is then possible to filter out the audio for meeting preset condition from audio to be detected, and to be detected according to the audio acquisition filtered out The characteristic sequence of audio, for example, can be sampled to audio to be detected, the pretreatment such as framing and adding window, after obtaining pretreatment Audio, to pretreated audio carry out integral transformation, obtain the corresponding frequency spectrum of pretreated audio, it is true according to the frequency spectrum The energy spectrum of fixed pretreated audio, the sound for meeting preset condition is filtered out according to energy spectrum from pretreated audio Frequently, so as to the interference tones in audio to be detected being filtered and are filtered out required audio frequency characteristics.And obtain base The reference characteristic sequence of quasi- audio, the audio which can acquire from server, or obtained with other approach The audio etc. taken；At this point it is possible to the characteristic sequence of audio to be detected and the reference characteristic sequence of benchmark audio be obtained, then to this Two characteristic sequences are extended Manchester's code, and the similarity distance after the two determining codings between characteristic sequence, example Such as editing distance, Euclidean distance and Hamming distance, the similarity distance can reduce many factors and detect knot to similarity The influence of fruit can finally determine the similarity between audio to be detected and benchmark audio according to similarity distance, improve audio The accuracy of similarity detection；Etc..

It should be noted that the schematic diagram of a scenario of audio similarity detection method shown in FIG. 1 is only an example, this The scene of the audio similarity detection method of inventive embodiments description is the skill in order to more clearly illustrate the embodiment of the present invention Art scheme does not constitute the restriction for technical solution provided in an embodiment of the present invention, those of ordinary skill in the art it is found that with The differentiation of audio similarity detection method and the appearance of new business scene, technical solution provided in an embodiment of the present invention is for class As technical problem, it is equally applicable.

It is described in detail separately below.

In the present embodiment, it will be described from the angle of audio similarity detection device, audio similarity detection dress Setting, which specifically can integrate, has storage element in tablet computer, mobile phone and laptop etc. and is equipped with microprocessor and has Have in the terminal of operational capability.

A kind of audio similarity detection method, comprising: obtain audio to be detected；It is pre- that satisfaction is filtered out from audio to be detected If the audio of condition, and obtain according to the audio filtered out the characteristic sequence of audio to be detected；The benchmark for obtaining benchmark audio is special Levy sequence；Similarity distance between the characteristic sequence for obtaining audio to be detected, and the reference characteristic sequence of benchmark audio；According to phase The similarity between audio and benchmark audio to be detected is determined like distance.

Referring to Fig. 2, Fig. 2 is the flow diagram for the audio similarity detection method that one embodiment of the invention provides.It should Audio similarity detection method may include:

In step s101, audio to be detected is obtained.

The audio to be detected can be the audio etc. that user sings a song or says one section, for example, working as audio similarity When detection method is applied to the scene of song scoring, the original singer's audio and audio accompaniment of an available song are as reference note Frequently and available user records the audio of head song as audio to be detected, it is subsequent can determine the benchmark audio and to The similarity between audio is detected, can be when similarity is greater than default similarity threshold can get red packet or get empirical value Deng.

When audio similarity detection method is applied to the scene of sound lock, available user records one section of benchmark in advance Audio is locked as sound, when unlock the audio to be detected for unlock recorded of available user, it is subsequent to determine the base Similarity between quasi- audio and audio to be detected can be similarity and be greater than default similarity threshold (such as close to percent Hundred) sound lock can be just unlocked in the case where.

It should be noted that the audio similarity detection method can also be applied to the other field of acoustic processing, for example, Pitch detection, loudness of a sound detection or sound quality detection of sound etc..

For example, can use sample rate is 16KHZ or the audio of other sample rates during obtaining audio to be detected The audio that data format acquisition user speaks or sings obtains audio to be detected and can be code rate to be 16bit or other code rates Continuous impulse coded modulation (Pulse Code Modulation, PCM) signal.

In step s 102, the audio for meeting preset condition is filtered out from audio to be detected, and according to the sound filtered out Frequency obtains the characteristic sequence of audio to be detected.

After obtaining audio to be detected, spectrum signature extraction, characteristic filter and screening etc. can be carried out to audio to be detected, To filter out required characteristic sequence, wherein the intensity of sound can be the intensity of sound of audio, which can root Flexible setting is carried out according to actual needs, this feature sequence may include the frequency sequence etc. filtered out from audio to be detected.

In some embodiments, the audio for meeting preset condition is filtered out from audio to be detected, and according to filtering out Audio obtain the characteristic sequence of the audio to be detected and may include:

(1) audio to be detected is pre-processed, obtains pretreated audio；

(2) energy spectrum of pretreated audio is obtained；

(3) it according to energy spectrum, filters out the audio for meeting preset condition from pretreated audio, and will filter out The corresponding frequency sequence of audio is set as the characteristic sequence of audio to be detected.

Firstly, being screened for convenience to audio to be detected, audio to be detected can be pre-processed, in certain realities It applies in mode, audio to be detected is pre-processed, obtaining pretreated audio may include: according to default sampling policy pair Audio to be detected is sampled, the audio after being sampled；The audio after sampling is carried out at framing according to default framing strategy Reason, the audio after obtaining framing；Windowing process is carried out to the audio after framing, obtains discrete pretreated audio.

Specifically, successively audio to be detected can be sampled, the processing such as framing and adding window, wherein sub-frame processing can To be the audio for being divided to obtain a frame frame to audio, for example, one minute audio can be divided according to one frame per second To 60 frame audios.After carrying out framing to audio, it is likely to result in the spectrum energy leakage of audio, it therefore, can be further Windowing process is carried out to audio is obtained after framing, which can be right using different cutted functions (i.e. windowed function) Signal is intercepted, so that the spectrum energy of audio is more concentrated, close to true frequency spectrum, audio sampled, framing and is added After window, the audio signal for the discrete amplitude sequence being distributed along the time axis.For example, can be according to default sampling policy benefit It is 44100HZ or other sample frequencys etc. with sample frequency, audio to be detected is sampled, the audio after being sampled should Default sampling policy can be the sampling policy for meeting nyquist sampling law.Then, according to default framing strategy as used Framing length be 512 or 1024 sampled points and frame move be frame length a half or thirds etc., to the audio after sampling Sub-frame processing is carried out, the audio after obtaining framing then can be using Hamming window function, rectangular window function or hamming window function etc. Windowing process is carried out to the audio after framing, obtains discrete pretreated audio.

Wherein, frame length can refer to the length of the data frame of audio, for example, when the length of the sampled point of audio is 512, and When sample frequency is 44100HZ, frame length is the length that 1/44100*512 obtains being approximately equal to 11.6 milliseconds.Frame shifting can be front and back The lap of two frame audios, for example, it is the one of frame length that the frame, which moves, when the lap of two frame audio of front and back is the half of frame length Half.

Then, the energy spectrum of pretreated audio is obtained, in some embodiments, obtains pretreated audio Energy spectrum may include: to carry out integral transformation to pretreated audio, obtain the corresponding frequency spectrum of pretreated audio；According to Frequency spectrum determines the energy spectrum of pretreated audio.

Wherein, integral transformation may include Fourier transformation and Laplace transform etc., will be to use Fu in the present embodiment In be described in detail for leaf transformation.For example, 2048 points or 1024 points etc. can be carried out to pretreated audio Short Time Fourier Transform obtains the corresponding frequency spectrum of each frame audio in pretreated audio, then to pretreated audio Frequency spectrum modulus square, obtains the corresponding energy spectrum of pretreated audio, which can be every frame audio in each frequency Matrix composed by the energy size of distribution.

It should be noted that the feature for needing to extract in the embodiment of the present invention is in addition to that can extract frequency by Fourier transformation Except spectrum signature, can also obtain short-time average zero-crossing rate, short-time energy, Energy-Entropy, spectral centroid, frequency spectrum extensibility, spectrum entropy, Spectral flux, spectral roll-off low spot, coloration spectrum signature, and/or mel cepstrum coefficients etc. are used for the parameter of audio processing, these are not Same feature is applicable to different application scenarios.

Secondly, in order to filter out the lower interference tones of intensity of sound, it can be based on the energy of pretreated audio Spectrum, filters out the audio for meeting preset condition from audio to be detected, in some embodiments, according to energy spectrum, from pre- Filter out that meet the audio of preset condition may include: that the sound that obtains audio to be detected according to energy spectrum is strong in audio after reason Degree；The audio that intensity of sound is greater than preset threshold is filtered out from audio to be detected, is obtained intensity of sound and is met preset condition Audio.

For example, energy spectrum S can be converted to the matrix P of intensity of sound expression, intensity of sound table is converted by energy spectrum The formula shown can be such that

Wherein, S indicates that energy spectrum matrix, P indicate that intensity of sound matrix, a and ref indicate that coefficient, such as a can take 10, Ref can take 1 or other values etc., and when S is equal to ref, P is equal to 0, can determine audio to be detected according to the formula (1) Intensity of sound can filter out the audio that intensity of sound is greater than preset threshold from audio to be detected at this time, obtain intensity of sound Meet the audio of preset condition, so as to which the lower interference tones of intensity of sound are filtered out, which can basis Actual needs carries out flexible setting, and specific value is not construed as limiting here.

In some embodiments, the audio that intensity of sound is greater than preset threshold is filtered out from audio to be detected, is obtained The audio that intensity of sound meets preset condition may include: that the intensity of sound of audio to be detected is normalized into preset sound intensity Range obtains intensity of sound standardized audio；Intensity of sound is filtered out from intensity of sound standardized audio greater than preset threshold Audio, obtain the audio that intensity of sound meets preset condition.

For example, the intensity of sound P of audio to be detected can be normalized into 0~b decibels (db), meet the Auditory Perception of people Range, standardization formula are as follows:

S_P=max (P, max (P)-b) (2)

Wherein, which can carry out flexible setting according to actual needs, for example, can will be to be detected The intensity of sound P of audio is normalized into 0~80db, i.e. b can take 80, S_P to indicate that the sound of intensity of sound standardized audio is strong Matrix is spent, P indicates the intensity of sound matrix before standardization,

The preset threshold of intensity of sound can be set at this time, and preset threshold can will be lower than in intensity of sound standardized audio Intensity of sound zero setting, will be screened higher than preset threshold in intensity of sound standardized audio, and obtain intensity of sound satisfaction The audio of preset condition, due in audio to be detected accompaniment and background sound etc. be all interference tones, the preset threshold is arranged can Rationally to be filtered to interference tones.

After filtering out and meeting the audio of preset condition, the corresponding frequency sequence of the audio filtered out can be set to The characteristic sequence of detection audio is arranged for example, can be ranked up from big to small to the audio filtered out according to intensity of sound Audio after sequence；From the audio for extracting maximum acoustic intensity after sequence in audio, the corresponding frequency of the audio of maximum acoustic intensity Sequence is exactly the characteristic sequence of audio to be detected.

For example, can by filtered intensity of sound matrix S_P (audio filtered out) by intensity of sound from big to small Be ranked up, audio after being sorted, then, from extracted in audio after sequence the maximum preset audio of intensity of sound (such as The audio of preceding 6 dimension maximum acoustic intensity), and extract from the frequency matrix of preset audio default dimension frequency sequence (such as 6 Dimension), such as the frequency sequence of sextuple maximum intensity of sound before every frame audio is extracted, the frequency sequence is as finally obtained The characteristic sequence of audio to be detected.

Do not carry out the processing of sufficient Feature Engineering compared with the existing technology, for example, audio frequency characteristics are not filtered and The processing such as screening, and since audio to be detected itself has the characteristics that pause or power, individual features are in time domain and frequency domain There is the characteristics of differentiation of length or size, be directed to audio to be detected in the embodiment of the present invention, has carried out at sufficient Feature Engineering Reason, such as audio to be detected is pre-processed, energy spectrum is obtained, spectrum signature is filtered and is arranged according to energy size Sequence, n dimension maximum feature of (such as n=6) energy etc. before filtering out, so as to reduce caused by subsequent determining similarity accidentally Difference.

It needs, when there are such as audio accompaniment interference tones in audio to be detected, such as in audio to be detected Audio accompaniment can be weakened including audio user and audio accompaniment in order to improve the accuracy of subsequent determining similarity. Optionally, during obtaining the characteristic sequence of audio to be detected, the root mean square average energy value of available audio user, with And obtain the root mean square average energy value of audio accompaniment；The energy spectrum of audio user is obtained, and obtains the energy spectrum of audio accompaniment； According to the energy spectrum of audio user, the root mean square average energy value of audio user, the root mean square average energy value of audio accompaniment and accompaniment The energy spectrum of audio optimizes audio to be detected, audio to be detected after being optimized；Obtain audio to be detected after optimization Characteristic sequence.

Wherein, optimization, which refers to, is weakened or is filtered to interference tones such as the audio accompaniments for including in audio, to audio The purpose optimized is to weaken the influence of interference tones, such as reduce the influence that environmental noise determines similarity.Due to Include interference tones in the audio before being optimized to audio, therefore, detects knot to weaken interference tones to audio similarity The influence of fruit can optimize audio, and in the audio after obtained optimization, interference tones have been weakened or have filtered.

Optionally, the root mean square average energy value of audio user is obtained, and obtains the root mean square average energy value of audio accompaniment It may include: the root mean square energy of determining audio user, and determine the root mean square energy of audio accompaniment；Obtain audio user Frame number and frame length, and obtain the frame number and frame length of audio accompaniment；It is true according to the root mean square energy, frame number and frame length of audio user Determine the root mean square average energy value of audio user, and accompaniment tone is determined according to root mean square energy, frame number and the frame length of audio accompaniment The root mean square average energy value of frequency.

Such as, it is first determined the root mean square energy of each frame audio in audio user, then obtain audio user frame number and Frame length；The root mean square average energy value of audio user is determined according to root mean square energy, frame number and the frame length of audio user, is calculated public Formula can be such that

Wherein, M indicates that frame number, N indicate frame length, x_ij(n) indicate j-th of sampled point of the i-th frame amplitude, audio accompaniment it is equal Root average energy value can also be obtained according to formula (3) determination.

At this point it is possible to determine between the root mean square average energy value of audio user and the root mean square average energy value of audio accompaniment Ratio, for example, root mean square average energy value of the root mean square average energy value of audio user divided by audio accompaniment, obtains the two root mean square The ratio of average energy value, calculation formula can be such that

Ratio between audio user and the root mean square average energy value of audio accompaniment reflects audio user and audio accompaniment Between intensity of sound it is relatively strong and weak.

It is then possible to obtain the energy spectrum of audio user, and the energy spectrum of audio accompaniment is obtained, according to audio user The root mean square average energy value of energy spectrum, the energy spectrum of audio accompaniment and audio user and the root mean square energy of audio accompaniment are equal Ratio between value optimizes audio to be detected, audio to be detected after being optimized, for example, the energy spectrum of audio user The energy spectrum of the audio accompaniment of corresponding ratio is subtracted, calculation formula can be such that

Matrix of differences=audio user energy spectrum-audio accompaniment energy spectrum × ratio (5)

Wherein, matrix of differences is audio to be detected after optimizing, which can be by carrying out to audio accompaniment Weaken, enhances the eigenmatrix of audio user (i.e. voice feature).At this time after available optimization audio to be detected feature Sequence.

The interference of audio accompaniment and environmental noise for including in audio to be detected etc., example are not considered in compared with the existing technology If audio to be detected has done many stereo process, there are biggish differences for audio user and audio accompaniment etc., directly determine similar Degree will lead to biggish error, and can be treated according to the power relatively of audio user and audio accompaniment in the embodiment of the present invention Detection audio has carried out audio accompaniment decrease, enhances the audio user for comparing, therefore no matter have cappela, can essence Standard detects the similarity between benchmark audio and audio user.

In step s 103, the reference characteristic sequence of benchmark audio is obtained.

The benchmark audio can be to be obtained from server, or prerecord, for example, in the applied field of song scoring Scape can be upper downloading or the original singer's audio and audio accompaniment of prerecording a song from server as benchmark audio；? The application scenarios of sound lock, available user record a segment of audio as benchmark audio (i.e. sound lock) etc. in advance.Reference note The reference characteristic sequence of frequency may include that the frequency sequence etc. for meeting preset condition is filtered out from benchmark audio, the reference characteristic Sequence, which can be, to be predefined good and is stored in local, or when needing to use reference characteristic sequence, is carried out to benchmark audio What feature extraction obtained.

For example, can use sample rate is 16KHZ or the audio number of other sample rates during obtaining benchmark audio Benchmark audio is acquired according to format, it is 16bit or the continuous P CM signal of other code rates which, which can be code rate,.

Optionally, after obtaining benchmark audio, the target audio for meeting preset condition can be filtered out from benchmark audio, And the reference characteristic sequence of benchmark audio is obtained according to the target audio filtered out.

Optionally, the target audio for meeting preset condition is filtered out from benchmark audio, and according to the target sound filtered out The reference characteristic sequence that frequency obtains benchmark audio may include: to pre-process to benchmark audio, reference note after being pre-processed Frequently；Obtain the energy spectrum of benchmark audio after pre-processing；The mesh for meeting preset condition is filtered out from benchmark audio according to energy spectrum Mark with phonetic symbols frequency, and set the corresponding frequency sequence of the target audio filtered out to the reference characteristic sequence of benchmark audio.

Benchmark audio is screened for convenience, benchmark audio can be pre-processed, optionally, to benchmark audio It is pre-processed, benchmark audio may include: to sample according to default sampling policy to benchmark audio after being pre-processed, and obtain Benchmark audio after to sampling；Sub-frame processing is carried out to benchmark audio after sampling according to default framing strategy, obtains benchmark after framing Audio；Windowing process is carried out to benchmark audio after framing, obtains benchmark audio after the pretreatment of discrete time-domain.

For example, can be 44100HZ or other sample frequencys etc. using sample frequency according to default sampling policy, to benchmark Audio is sampled, benchmark audio after being sampled, which, which can be, meets adopting for nyquist sampling law Sample strategy.Then, it is pipetted according to the framing length that default framing strategy uses for 512 or 1024 sampled points etc. and frame The a half or thirds etc. of frame length carry out sub-frame processing to benchmark audio after sampling, obtain benchmark audio after framing, at this time may be used To carry out windowing process to benchmark audio after framing using Hamming window function, rectangular window function or hamming window function etc., obtain from Benchmark audio after the pretreatment of scattered time domain, i.e., benchmark audio can be discrete time-domain audio signal amplitude sequence after the pretreatment Column.

Optionally, the energy spectrum for obtaining benchmark audio after pre-processing may include: to accumulate to benchmark audio after pretreatment Divide transformation, the corresponding frequency spectrum of benchmark audio after being pre-processed；The energy spectrum of benchmark audio after pretreatment is determined according to frequency spectrum.

Wherein, integral transformation may include Fourier transformation and Laplace transform etc., for using Fourier transformation, For example, the Short Time Fourier Transform of 2048 points or 1024 points etc. can be carried out to benchmark audio after pretreatment, pre- place is obtained The corresponding frequency spectrum of each frame audio obtains pre- then to the frequency spectrum modulus square of benchmark audio after pretreatment in benchmark audio after reason The corresponding energy spectrum of benchmark audio after processing, the energy spectrum can be the energy size that every frame benchmark audio is distributed in each frequency Composed matrix.

Optionally, filter out that meet the target audio of preset condition may include: root from benchmark audio according to energy spectrum The intensity of sound of benchmark audio is obtained according to energy spectrum；The audio that intensity of sound is greater than preset threshold is filtered out from benchmark audio, Obtain the target audio that intensity of sound meets preset condition.

For example, intensity of sound can be converted by the energy spectrum of benchmark audio according to above-mentioned formula (1), it at this time can be from base The audio that intensity of sound is greater than preset threshold is filtered out in quasi- audio, obtains the target audio that intensity of sound meets preset condition, So as to which the lower interference tones of intensity of sound are filtered out, which can flexibly be set according to actual needs It sets, specific value is not construed as limiting here.

Optionally, the audio that intensity of sound is greater than preset threshold is filtered out from benchmark audio, obtains intensity of sound satisfaction The target audio of preset condition may include: that the intensity of sound of benchmark audio is normalized into preset sound strength range, obtain Intensity of sound standardized benchmark audio；Intensity of sound is filtered out from intensity of sound standardized benchmark audio greater than preset threshold Audio obtains the target audio that intensity of sound meets preset condition.

For example, the intensity of sound P of audio to be detected can be normalized into 0~80db according to above-mentioned formula (2), meet people Auditory Perception range, the preset threshold of intensity of sound can be set at this time, can will be in intensity of sound standardized benchmark audio Lower than the intensity of sound zero setting of preset threshold, screening for preset threshold will be higher than in intensity of sound standardized benchmark audio, Obtain the target audio that intensity of sound meets preset condition, due in benchmark audio accompaniment and background sound etc. be all interference sound Frequently, the preset threshold, which is arranged, can rationally filter interference tones.

It, can be by the corresponding frequency sequence of the target audio filtered out after filtering out and meeting the target audio of preset condition Be set as the reference characteristic sequence of benchmark audio, for example, can to the target audio filtered out according to intensity of sound from big to small It is ranked up, target audio after being sorted；From the maximum audio of intensity of sound is extracted after sequence in target audio, most loudly The corresponding frequency sequence of the audio of loudness of a sound degree is exactly the characteristic sequence of benchmark audio.Such as extract sextuple maximum before every frame audio Intensity of sound frequency sequence, which is the characteristic sequence of finally obtained benchmark audio.Due to acoustic to be checked Frequency itself has the characteristics that pause or power, and individual features also have the differentiation of length or size in time domain and frequency domain, for The characteristics of detecting audio, pre-processes benchmark audio, obtains energy spectrum, is filtered according to energy size to spectrum signature And sequence, n dimension maximum feature of energy etc. before filtering out, so as to reduce error caused by subsequent determining similarity.

In some embodiments, when in benchmark audio including target fiducials audio and interference tones, reference note is obtained The reference characteristic sequence of frequency may include: to obtain the first root mean square average energy value of target fiducials audio, and obtain interference sound Second root mean square average energy value of frequency；The first energy spectrum of target fiducials audio is obtained, and obtains the second energy of interference tones Amount spectrum；According to the first energy spectrum, the first root mean square average energy value, the second root mean square average energy value and the second energy spectrum, to benchmark Audio optimizes, the benchmark audio after being optimized；The reference characteristic sequence of benchmark audio after obtaining optimization.

In some embodiments, the first root mean square average energy value of target fiducials audio is obtained, and obtains interference sound Second root mean square average energy value of frequency may include: the first root mean square energy of determining target fiducials audio, and determine interference Second root mean square energy of audio；The first frame number and the first frame length of target fiducials audio are obtained, and obtains interference tones Second frame number and the second frame length；The of target fiducials audio is determined according to the first root mean square energy, the first frame number and the first frame length One root mean square average energy value, and the second of interference tones are determined according to the second root mean square energy, the second frame number and the second frame length Root mean square average energy value.

For example, can determine the first root mean square average energy value of target fiducials audio according to above-mentioned formula (3), and interference Then second root mean square average energy value of audio determines the first root mean square average energy value and interference tones of target fiducials audio Secondly ratio between second root mean square average energy value subtracts the interference tones of the ratio using the energy spectrum of target fiducials audio Energy spectrum, to optimize to benchmark audio, benchmark audio after being optimized, the benchmark audio after the optimization be can be Interference tones are weakened, the eigenmatrix of target fiducials audio is enhanced, the reference note after last available optimization The reference characteristic sequence of frequency, so as to according to the relatively strong and weak of target fiducials audio and interference tones, in benchmark audio Interference tones have carried out (such as audio accompaniment) decrease, enhance the target fiducials audio (such as original singer's audio) for comparing, Therefore the similarity between benchmark audio and audio to be detected can be precisely detected.

In step S104, the characteristic sequence of audio to be detected is obtained, between the reference characteristic sequence of benchmark audio Similarity distance.

Wherein, which at least may include editing distance, Euclidean distance and Hamming distance etc., the editor away from From the main component that can be used for measuring similarity；Euclidean distance can be used for measuring the otherness of coded sequence, to similarity As a result it is punished；Hamming distance can be used for measuring the absolute consistency of coded sequence, to similarity result positive feedback.With Under will be described in more detail.

In some embodiments, the characteristic sequence for obtaining audio to be detected, with the reference characteristic sequence of benchmark audio it Between similarity distance may include: to be encoded according to characteristic sequence of the pre-arranged code strategy to audio to be detected, obtain first Characteristic sequence after coding, and encoded according to reference characteristic sequence of the pre-arranged code strategy to benchmark audio, obtain Characteristic sequence after two codings；Between characteristic sequence after determining the first coding and the characteristic sequence after the second coding it is similar away from From.

It, can characteristic sequence and reference note to audio to be detected in order to improve the accuracy and stability that similarity determines The reference characteristic sequence of frequency is encoded, and determines similarity distance based on characteristic sequence after coding.Wherein, pre-arranged code strategy can To carry out flexible setting according to actual needs, for example, pre-arranged code strategy may include that Differential Manchester Encoding, non-return-to-zero are anti- Mutually coding (NRZI, No Return Zero-Inverse), Manchester's code and extension Manchester's code etc..

In some embodiments, it encodes, obtains according to characteristic sequence of the pre-arranged code strategy to audio to be detected Characteristic sequence after first coding may include: according to pre-arranged code strategy by the characteristic sequence of audio to be detected, per adjacent Two characteristic values carry out size comparison；It, will be to when characteristic value previous in two neighboring characteristic value is less than later feature value The characteristic sequence of detection audio is encoded to the first encoded radio, and, after characteristic value previous in two neighboring characteristic value is equal to When one characteristic value, the characteristic sequence of audio to be detected is encoded to the second encoded radio；And before in two neighboring characteristic value When one characteristic value is greater than later feature value, the characteristic sequence of audio to be detected is encoded to third encoded radio；Based on first Encoded radio, the second encoded radio and/or third encoded radio generate the characteristic sequence after the first coding.

For the pre-arranged code strategy for extending Manchester's code, the coding rule of the extension Manchester's code can be with Are as follows: if two neighboring characteristic value changes from low to high in characteristic sequence, is the first coding by the feature coding of audio to be detected Value, such as it is encoded to " 1 "；If two neighboring characteristic value remains unchanged in characteristic sequence, by the feature coding of audio to be detected For the second encoded radio, such as it is encoded to " 0 "；It, will be to be detected if two neighboring characteristic value changes from high to low in characteristic sequence The feature coding of audio is third encoded radio, such as is encoded to " -1 ".

For example, can since in the characteristic sequence of audio to be detected be located at primary characteristic value, first can will It is 0 positioned at primary Coding pattern features, then, primary characteristic value will be located at and compared with deputy characteristic value is located at Compared with either, can not being encoded to primary characteristic value is located at, directly will be located at primary characteristic value and be located at the Two characteristic values are compared.When primary characteristic value is less than deputy characteristic value, it is encoded to " 1 ", and, when the When one characteristic value is equal to deputy characteristic value, it is encoded to " 0 "；And when primary characteristic value is greater than deputy When characteristic value, it is encoded to " -1 ".Further, deputy characteristic value will be located to compare with the characteristic value for being located at third position Compared with, and so on, it finishes, obtains to be checked until each adjacent two characteristic value in the characteristic sequence of audio to be detected is compared Characteristic sequence after corresponding first coding of acoustic frequency.Characteristic sequence after first coding can be formed by -1,0 or 1, should The frequecy characteristic that characteristic sequence after first coding can be used for characterizing audio to be detected changes in the height of time scale.

Likewise, it is directed to benchmark audio, it can also be according to the coding rule of the extension Manchester's code to benchmark audio Reference characteristic sequence encoded, in some embodiments, according to pre-arranged code strategy to the reference characteristic of benchmark audio Sequence is encoded, and the characteristic sequence after obtaining the second coding may include: according to pre-arranged code strategy by the spy of benchmark audio It levies in sequence, each adjacent two characteristic value carries out size comparison；When characteristic value previous in two neighboring characteristic value is less than latter When a characteristic value, the characteristic sequence of benchmark audio is encoded to the first encoded radio, and, when previous in two neighboring characteristic value When characteristic value is equal to later feature value, the characteristic sequence of benchmark audio is encoded to the second encoded radio；And when two neighboring When previous characteristic value is greater than later feature value in characteristic value, the characteristic sequence of benchmark audio is encoded to third encoded radio； Characteristic sequence after generating the second coding based on the first encoded radio, the second encoded radio and/or third encoded radio.

Since audio to be detected or benchmark audio are easy by individual difference and Effect of gender, for example, female voice is relative to male The frequency of sound is higher, and different people is different in the base frequency for sending out phone same, pronunciation length also difference etc., if therefore passing through letter The mode of single given threshold and parameter eliminates the influence of individual difference bring, then is easy by subjective factor and data scale It influences, it is not accurate enough and stable, and using extension Manchester's code to the feature sequence of audio to be detected in the embodiment of the present invention The reference characteristic sequence of column and benchmark audio is encoded, and the similitude by characteristic sequence after determining coding is to be detected to characterize Similarity between audio and benchmark audio eliminates the disturbing factors such as audio accompaniment, individual and gender differences and examines to similarity Survey the influence of result accuracy.

In some embodiments, similarity distance includes at least editing distance, Euclidean distance and Hamming distance, determines The similarity distance between the characteristic sequence after characteristic sequence and the second coding after first coding may include: at least determining first Editing distance, Euclidean distance and the Hamming distance between the characteristic sequence after characteristic sequence and the second coding after coding； Editing distance, Euclidean distance and Hamming distance are normalized respectively, obtain similarity distance.

Wherein, editing distance can be pointer for characteristic sequence after two codings, by feature sequence after one of coding Column are converted into minimum edit operation times needed for characteristic sequence after another is encoded.Editing distance is bigger, illustrates two codings Different characteristic is more between characteristic sequence afterwards, conversely, editing distance is smaller, illustrates different special between characteristic sequence after two codings Levy it is fewer, the edit operation may include a characteristic character is substituted for another characteristic character, insertion one characteristic character, And delete a characteristic character etc., this feature character can be " 1 ", " 0 " or " -1 " etc. that coding obtains.After determining the first coding Characteristic sequence and second coding after characteristic sequence between editing distance, that is, determine first coding after characteristic sequence conversion At minimum edit operation times needed for the characteristic sequence after the second coding, after the first coding can be measured using editing distance The similitude of the two characteristic sequences entirety such as the characteristic sequence after characteristic sequence and the second coding, preferably solves due to hair Alignment problem caused by the short difference of the duration of a sound etc..

Euclidean distance can refer to that the characteristic sequence after the first coding and the characteristic sequence after the second coding are several in Europe In the linear distance of point-to-point transmission in space, in the embodiment of the present invention Euclidean distance be used to measure first encode after feature Difference degree between the two characteristic sequences such as the characteristic sequence after sequence and the second coding.Such as benchmark audio can be set Characteristic sequence after (such as original singer's audio) corresponding second coding is (x1, x2 ..., xn), audio to be detected (such as with Family audio) characteristic sequence after corresponding first coding is (y1, y2 ..., yn), wherein and n is feature after the two codings The length of maximum length sequence in sequence, the value of n can carry out flexible setting according to actual needs, for example, curtailment n can be with Zero padding.Euclidean distance d between the characteristic sequence after characteristic sequence and the second coding after first coding₂Calculation formula can With as follows:

Hamming distance can refer to the characteristic sequence after the first coding and the characteristic sequence corresponding position after the second coding not With characteristic character number, i.e., the characteristic sequence after the first coding is transformed into replacement required for the characteristic sequence after the second coding Number, the Hamming distance can be used for measuring characteristic sequence after characteristic sequence and the second coding after the first coding etc. this two The absolute consistency of a sequence corresponding position.

Obtaining editing distance d₁, Euclidean distance d₂With Hamming distance d₃It afterwards, can be to editing distance, Euclid Distance and Hamming distance are normalized respectively, wherein due to obtaining editing distance d₁, Euclidean distance d₂And Hamming distance d₃Deng may be larger, the similarity of subsequent determining audio for convenience, therefore can be to obtaining editing distance d₁, Euclid away from From d₂With Hamming distance d₃Etc. being normalized, which, which refers to, returns editing distance, Euclidean distance and Hamming distance etc. One changes in the range of 0~1.For example, can be according to following formula (7) to editing distance d₁It is normalized, is normalized Postedit distance is D₁；To Euclidean distance d₂It is normalized, Euclidean distance is D after being normalized₂；To Hamming Distance d₃It is normalized, Hamming distance is D after being normalized₃, normalization postedit distance is D₁, in Europe is several after normalization Obtaining distance is D₂, i.e. normalize after Hamming distance be D₃As similarity distance.

In step s105, the similarity between audio to be detected and benchmark audio is determined according to similarity distance.

In some embodiments, determine that the similarity between audio to be detected and benchmark audio can be with according to similarity distance It include: the affine function constructed in editing distance, Euclidean distance and Hamming distance between each distance and sub- similarity；According to It is respectively determined respectively apart from corresponding affine function each apart from corresponding sub- similarity；According to sub- similarity determine audio to be detected and Similarity between benchmark audio.

Wherein, establish similarity about the affine function of similarity distance can refer to will normalization obtain editing distance, Euclidean distance and Hamming distance establish both independent variable and dependent variable using similarity as dependent variable as independent variable Between mapping relations.Can use affine function by after normalization editing distance, Euclidean distance and Hamming distance it is true Make the sub- similarity being normalized into 0~100 range.

The affine function in editing distance, Euclidean distance and Hamming distance between each distance and sub- similarity is constructed, Establish sub- similarity and editing distance D₁Between the first affine function be F (D₁), shown in the following formula of expression formula (8)；It establishes Sub- similarity and Euclidean distance D₂Between the second affine function be F (D₂), shown in the following formula of expression formula (10)；It builds Found sub- similarity and Hamming distance D₃Between third affine function be F (D₃), shown in the following formula of expression formula (12).

Wherein, the n in formula (8)₁To n₈And n₁₀To n₄₄Value can carry out flexible setting according to actual needs, For example, n₁To n₈And n₁₀To n₄₄After taking analog value, available first affine function is F (D₁) as shown in formula (9).

Wherein, the c in formula (10)₁To c₄Value can carry out flexible setting according to actual needs, for example, c₁To c₄ After taking analog value, available second affine function is F (D₂) as shown in formula (11).

Wherein, the m in formula (12)₁To m₆And m₁₀To m₃₆Value can carry out flexible setting according to actual needs, For example, m₁To m₆And m₁₀To m₃₆After taking analog value, available third affine function is F (D₃) as shown in formula (13).

Obtaining editing distance D₁Corresponding first affine function is F (D₁), Euclidean distance D₂Corresponding second is affine Function is F (D₂) and Hamming distance D₃Corresponding third affine function is F (D₃) after, it can be F according to the first affine function (D₁) determine editing distance D₁Corresponding first sub- similarity is F (D according to the second affine function₂) determine Euclidean distance D₂ Corresponding second sub- similarity, and according to third affine function be F (D₃) determine Hamming distance D₃Corresponding third is similar Degree, can determine audio to be detected and reference note according to the first sub- similarity, the second sub- similarity and the sub- similarity of third at this time Similarity between frequency.

It should be noted that when determining sequence similarity, in addition to editing distance, Euler's distance and Hamming distance can be used From come except determining, can also be determined using alignment algorithms such as dynamic time warping or Longest Common Substrings audio to be detected and Similarity between benchmark audio.

In some embodiments, determine that the similarity between audio to be detected and benchmark audio can be with according to sub- similarity It include: the first weighted value to be set for the sub- similarity of editing distance, and the second weighted value is set for the sub- similarity of Hamming distance； Penalty term is set by the sub- similarity of Euclidean distance；According to the first weighted value, the second weighted value and penalty term, determine to Detect the similarity between audio and benchmark audio.

For example, since editing distance overcomes pronunciation length or pause etc., and the characteristic with strong antijamming capability, because This can determine component for editing distance as most important similarity；Since Hamming distance has for measures characteristic sequence The characteristic of absolute consistency, therefore component can be determined using Hamming distance as the similarity of auxiliary；Due to Euclidean distance The geometric distance of measures characteristic sequence, the characteristic of the difference of prominent features sequence, therefore can be using Euclidean distance as phase Like the determining penalty term of degree.At this point it is possible to which the first weighted value, and the son for Hamming distance is arranged in the sub- similarity for editing distance The second weighted value is arranged in similarity, and sets penalty term for the sub- similarity of Euclidean distance, wherein the first weighted value and The value of second weighted value can carry out flexible setting according to actual needs, then according to the first weighted value, the second weighted value and Penalty term determines that the similarity between audio to be detected and benchmark audio, calculation formula can be as follows:

Wherein, SimilarityDegree indicates similarity, and N indicates the dimension in characteristic sequence comprising feature, for example, N Value can be 6, the corresponding similarity of characteristic sequence and be averaged after determining 6 dimension codings respectively, obtain audio to be detected Similarity testing result between benchmark audio, R₁Indicate the first weighted value, R₂Indicate the second weighted value, R₁And R₂Value Flexible setting can be carried out according to actual needs, for example, R₁Value can be 0.7, R₂Value can be 0.3, at this time can be with The calculation formula for obtaining similarity can be as follows:

Similarity determination can refer to unified editing distance, Euclidean distance and Hamming distance in calculating formula of similarity From, and the similarity being normalized into 0~100 range is determined according to this three's distance value.

In some embodiments, the step of the similarity between audio to be detected and benchmark audio is determined according to similarity distance After rapid, audio similarity detection method can also include: when the similarity between audio to be detected and benchmark audio is greater than in advance If when similarity threshold, executing virtual resource transfer operation, and/or show the correlation of the similarity testing result of audio to be detected Information.

For example, by taking K sings red packet as an example, relating generally to original singer's broadcasting, user sings, examines in the application scenarios of song scoring Survey the similarity of original singer's audio and audio user, similarity rank and get red packet etc..Specifically, firstly, user can choose Carrier of one section of original singer's audio as red packet, after user clicks the red packet, user can click " audition " button, generate and play Instruction, audio similarity detection device can play original singer's audio based on the play instruction, and user can listen to original singer's audio, or Person user can also click directly on " starting to sing " button, generate acquisition instructions, and user can follow accompaniment to imitate original singer at this time It sings, audio similarity detection device can acquire audio user based on acquisition instructions.Then the use that can be will acquire Family audio is as audio to be detected, and using original singer's audio as benchmark audio, respectively successively to audio user and original singer's audio It is pre-processed, the audio accompaniment that extracts in spectrum signature, original singer's audio and audio user weakens, characteristic filter and sieve Choosing, similarity distance metric, establishes affine function and similarity of the similarity about distance metric at extension Manchester's code Determine etc., obtain the similarity between original singer's audio and audio user.When similarity is greater than default similarity threshold, Yong Huke To get the red packet, i.e., triggering audio similarity detection apparatus executes virtual resource transfer operation and (optimizes user and get red packet i.e. For audio similarity detection device execute virtual resource transfer operation) and audio similarity detection device can show it is red The relevant information of the similarities testing result such as volume covered with gold leaf and the song of user；When similarity is less than or equal to default similarity When threshold value, user cannot get the red packet, and the relevant information for the similarities testing result such as prompt user's music composition for two or more, can move back at this time Red packet interface out, and audio user can be switched into one section of speech message with grading, the content of the speech message can be The audio that user follows accompaniment to sing；Etc..

In some embodiments, the step of the similarity between audio to be detected and benchmark audio is determined according to similarity distance After rapid, audio similarity detection method can also include: when the similarity between audio to be detected and benchmark audio is greater than in advance If when similarity threshold, audio lock operation is unlocked in execution.

For example, the available benchmark audio prerecorded is locked as sound, when audio phase in the application scenarios of sound lock When being not used like degree detection device, it is in locking-in state, when needing to unlock, it is to be detected that user can imitate the generation of benchmark audio Then audio successively pre-processes audio to be detected, extracts spectrum signature, characteristic filter and screening, extension Manchester Coding, similarity distance metric, establish similarity about distance metric affine function and similarity determine etc., obtain to be checked Similarity between acoustic frequency and benchmark audio.When similarity is greater than default similarity threshold, audio lock operation is unlocked in execution； When similarity is less than or equal to default similarity threshold, do not unlock, can also show at this time unlock failure, audio to be detected and The prompt informations such as the similarity between benchmark audio.

For example, the terminals such as mobile phone, smartwatch, smart television or computer (i.e. audio similarity detection device) are not used When be in screen lock state, when needing to unlock, user can against terminal imitate benchmark audio, at this time terminal can collect to Audio is detected, when the similarity between audio to be detected and benchmark audio is greater than default similarity threshold, terminal can be executed Unlock operation opens terminal, and enters display interface.Either, when terminal is in the open state, A is applied when needing to open When, user can imitate benchmark audio against terminal, and terminal can collect audio to be detected at this time, when audio to be detected and base When similarity between quasi- audio is greater than default similarity threshold, terminal can execute unlatching and operate using A.Either, work as sound When frequency similarity detection apparatus is gate inhibition, when needing to unlock gate inhibition, user can imitate benchmark audio against gate inhibition, at this time door Taboo can collect audio to be detected, when the similarity between audio to be detected and benchmark audio is greater than default similarity threshold When, it can be with opening gate；Etc..

It can stablize in the embodiment of the present invention and the similarity between accurate detection audio to be detected and benchmark audio, it should Similarity testing result is less to be influenced by disturbing factors such as audio accompaniment, environmental noise, individual and gender differences, that is, is overcome Influence due to audio accompaniment, environmental noise, body and gender differences difference etc. to similarity result, solve user only with accompaniment or The problem of original singer obtains high similarity is played, no matter has cappela that the similarity of audio to be detected and benchmark audio is used equally for examine It surveys, stability is good, and similarity testing result accuracy is higher.

From the foregoing, it will be observed that the available audio to be detected of the embodiment of the present invention, and filtered out completely from the audio to be detected The audio of sufficient preset condition, and obtain according to the audio filtered out the characteristic sequence of audio to be detected, so as to will be to be detected Interference tones in audio are filtered and filter out required audio frequency characteristics, and obtain the reference characteristic sequence of benchmark audio Column；Then, the similarity distance between the characteristic sequence for obtaining audio to be detected, and the reference characteristic sequence of benchmark audio, such as Editing distance, Euclidean distance and Hamming distance etc., the similarity distance can reduce many factors to similarity testing result Influence, the similarity between audio to be detected and benchmark audio can be determined according to similarity distance at this time, improves audio phase Like the accuracy of degree detection.

Citing, is described in further detail by the method according to described in above-described embodiment below.

For the present embodiment by taking audio similarity detection device is terminal as an example, the available terminal includes original singer's audio and companion The benchmark audio of audio is played, and obtains the audio to be detected that user records, then successively to benchmark audio and audio to be detected Carry out S1 pretreatment, S2 extracts spectrum signature, the audio accompaniment in S3 original singer audio and audio user weakens, S4 feature mistake Filter and screening, S5 extension Manchester's code, S6 similarity distance metric, S7 establish affine letter of the similarity about distance metric Several and S8 similarity calculation etc., obtains the similarity between benchmark audio and audio to be detected, as shown in figure 3, secondly judgement should Whether similarity is greater than default similarity threshold, when the similarity is greater than default similarity threshold, can execute virtual resource Transfer operation, and the relevant information etc. of display similarity testing result.

Referring to Fig. 4, Fig. 4 is the flow diagram of audio similarity detection method provided in an embodiment of the present invention.The party Method process may include:

S201, terminal obtain audio to be detected, are successively sampled to audio to be detected, framing and the pretreatment of adding window, Obtain pretreated audio.

The audio of the available user's recording song of terminal is as audio to be detected, for example, as shown in figure 5, user A is selected Carrier of one section of original singer's audio as red packet, such as the K of XXX sing red packet, after user clicks the red packet, can choose click " audition " button listens original singer's audio, and play instruction can be generated in activation " audition " button, and terminal can be broadcast based on the play instruction Original singer's audio is put, at this point, can show audition progress and lyrics etc. in display interface, or " starting to sing " is clicked directly on and presses Button generates acquisition instructions, and user can follow accompaniment to imitate original singer's audio and sing at this time, and terminal can be based on acquisition instructions Audio user is acquired, audio to be detected is obtained.

Audio to be detected is screened for convenience, audio to be detected can be pre-processed, comprising: utilizes satisfaction The sampling policy of nyquist sampling law is 44100HZ or other sample frequencys etc. to audio to be detected by sample frequency It is sampled, the audio after being sampled.Then, the framing length used is moved for 512 or 1024 sampled points etc. and frame The a half or thirds etc. for taking frame length carry out sub-frame processing to the audio after sampling, the audio after obtaining framing.It at this time can be with Windowing process is carried out to the audio after framing using Hamming window function, rectangular window function or hamming window function etc., when obtaining discrete The pretreated audio in domain.

For example, as shown in Fig. 6 (a) to 6 (d), wherein it may include original singer's audio and audio accompaniment in benchmark audio, to Detection audio can be user male audio or user female's audio, Fig. 6 (a) can be original singer's audio is pre-processed after obtain Initial time domain sample graph, Fig. 6 (b) can be audio accompaniment is pre-processed after obtained initial time domain sample graph, Fig. 6 (c) Can be the initial time domain sample graph obtained after pre-processing to user's male's audio, Fig. 6 (d) can be to user female's audio into The initial time domain sample graph obtained after row pretreatment.

S202, terminal carry out Fourier transformation to pretreated audio and obtain frequency spectrum, and are determined and pre-processed according to frequency spectrum The energy spectrum of audio afterwards.

Terminal can carry out the Short Time Fourier Transform of 2048 points or 1024 points etc. to pretreated audio, obtain The corresponding frequency spectrum of each frame audio, can be generated spectrum signature figure according to the frequency spectrum, then to pretreatment in pretreated audio The frequency spectrum modulus square of audio afterwards, obtains the corresponding energy spectrum of pretreated audio, which can be every frame audio Matrix composed by the energy size being distributed in each frequency.

For example, as shown in Fig. 7 (a) to 7 (d), wherein it may include original singer's audio and audio accompaniment in benchmark audio, to Detection audio can be user male audio or user female's audio, and Fig. 7 (a) can be to be obtained to after original singer's audio progress Fourier transformation The spectrum signature figure arrived, Fig. 7 (b), which can be, carries out the spectrum signature figure obtained after Fourier transformation, Fig. 7 (c) to audio accompaniment Can be and obtained spectrum signature figure after Fourier transformation carried out to user's male's audio, Fig. 7 (d) can be to user female's audio into The spectrum signature figure obtained after row Fourier transformation.

For example, as shown in figure 8, terminal can pass through audio user so that pretreated audio is audio user as an example Then the Short Time Fourier Transform of 2048 points extracts the energy spectrum of the audio user, can be based on the energy spectrum so as to subsequent Carry out characteristic filter and screening etc..

S203, terminal obtain the intensity of sound of audio to be detected according to energy spectrum, and screen and speak from audio to be detected Loudness of a sound degree is greater than the audio of preset threshold, and the characteristic sequence of audio to be detected is obtained according to the audio filtered out.

In order to filter out the lower interference tones of intensity of sound, terminal can be based on the energy of pretreated audio Spectrum, filters out the audio that intensity of sound meets preset condition from audio to be detected.For example, as shown in figure 9, terminal first can be with The eigenmatrix S of energy spectrum is standardized as intensity of sound matrix P, then judges that each intensity of sound is in intensity of sound matrix P It is no to be greater than preset threshold, and the intensity of sound zero setting of preset threshold will be less than or equal to, and will be greater than the sound of preset threshold Intensity filters out the intensity of sound greater than preset threshold, secondly, by big by (extracting the intensity of sound greater than preset threshold) In preset threshold intensity of sound according to intensity of sound from being ranked up to small greatly, finally from the intensity of sound matrix after sequence The frequency sequence for filtering out maximum preceding 6 dimension intensity of sound, obtains the characteristic sequence of audio to be detected.

Specifically, terminal can convert intensity of sound matrix P for energy spectrum matrix S according to above-mentioned formula (1), at this time may be used To filter out the audio that intensity of sound is greater than preset threshold from audio to be detected, so as to the lower interference of intensity of sound Filtered audio falls, which can carry out flexible setting according to actual needs, and specific value is not construed as limiting here.

Optionally, the intensity of sound of audio to be detected can be normalized into preset sound strength range by terminal, obtain sound The intensities normalised audio of sound filters out the audio that intensity of sound is greater than preset threshold from intensity of sound standardized audio, obtains Intensity of sound meets the audio of preset condition.

For example, the intensity of sound P of audio to be detected can be normalized into 0~80db according to above-mentioned formula (2) by terminal, symbol The Auditory Perception range for closing people, can set the preset threshold of intensity of sound at this time, can will be in intensity of sound standardized audio Lower than the intensity of sound zero setting of preset threshold, screening for preset threshold will be higher than in intensity of sound standardized audio, obtained Intensity of sound meets the audio of preset condition, due in audio to be detected accompaniment and background sound etc. be all interference tones, setting The preset threshold can rationally filter interference tones.

At this point, terminal can be ranked up the audio filtered out according to intensity of sound from big to small, sound after being sorted Frequently, it from extracting the maximum preset audio of intensity of sound after sequence in audio, and is extracted from the frequency matrix of preset audio Default dimension frequency sequence, obtains the characteristic sequence of audio to be detected.For example, sextuple maximum sound is strong before extracting every frame audio The frequency sequence of degree, the frequency sequence are the characteristic sequence of finally obtained audio to be detected.

Since audio to be detected itself has the characteristics that pause or power, individual features also have length in time domain and frequency domain Or the differentiation of size can be according to energy size to frequency spectrum by pre-processing to audio to be detected in the embodiment of the present invention Feature is filtered and sorts, and filters out preceding 6 dimension maximum feature of energy etc., is produced so as to reduce subsequent determining similarity Raw error.

S204, terminal obtain the root mean square average energy value that audio and audio accompaniment are sung in benchmark audio Central Plains, Yi Jiyuan respectively Sing the energy spectrum of audio and audio accompaniment.

The benchmark audio may include original singer's audio and audio accompaniment, which can be obtains from server, or The song that person prerecords.Terminal can obtain the energy spectrum that audio and audio accompaniment are sung in benchmark audio Central Plains respectively, optionally, Terminal can pre-process benchmark audio, benchmark audio after being pre-processed, comprising: according to default sampling policy to benchmark Audio is sampled, benchmark audio after being sampled；Sub-frame processing is carried out to benchmark audio after sampling according to default framing strategy, Obtain benchmark audio after framing；Windowing process is carried out to benchmark audio after framing, obtains reference note after the pretreatment of discrete time-domain Frequently.Then, after the available pretreatment of terminal benchmark audio energy spectrum, comprising: to after pretreatment benchmark audio carry out Fu in Leaf transformation, the corresponding frequency spectrum of benchmark audio after being pre-processed；The energy spectrum of benchmark audio after pretreatment is determined according to frequency spectrum.

Terminal can obtain the root mean square average energy value that audio and audio accompaniment are sung in benchmark audio Central Plains respectively, can wrap It includes: determining the first root mean square energy of original singer's audio, and determine the second root mean square energy of audio accompaniment, for example, for example, can To determine the first root mean square average energy value of original singer's audio and the second root mean square energy of audio accompaniment according to above-mentioned formula (3) Measure mean value；Then the first frame number and the first frame length of original singer's audio are obtained, and obtains the second frame number and second of audio accompaniment Frame length；The root mean square average energy value of original singer's audio, Yi Jigen are determined according to the first root mean square energy, the first frame number and the first frame length The root mean square average energy value of audio accompaniment is determined according to the second root mean square energy, the second frame number and the second frame length.

For example, as shown in figure 8, benchmark audio includes original singer's audio and audio accompaniment, terminal can respectively by original singer's audio and Audio accompaniment passes through the Short Time Fourier Transform of 2048 points, then extracts the energy spectrum of original singer's audio and audio accompaniment respectively, Secondly, determining the root mean square average energy value of original singer's audio and audio accompaniment respectively, and determine that the root mean square energy of original singer's audio is equal Ratio between value and the root mean square average energy value of audio accompaniment, finally, the ratio can be subtracted with the energy spectrum of original singer's audio Audio accompaniment energy spectrum, so as to obtain audio accompaniment weaken after benchmark audio, so as to it is subsequent can be to accompaniment tone Benchmark audio after frequency weakens carries out characteristic filter and screening etc., obtains characteristic sequence.

S205, terminal root mean square average energy value and energy spectrum based on original singer's audio and audio accompaniment, by audio accompaniment into Row weakens, the benchmark audio after being optimized, and obtains the reference characteristic sequence of the benchmark audio after optimization.

After obtaining the root mean square average energy value and energy spectrum of original singer's audio and audio accompaniment, terminal, which can determine, sings audio Root mean square average energy value and audio accompaniment root mean square average energy value between ratio, then utilize original singer's audio energy spectrum The energy spectrum of the audio accompaniment of the ratio is subtracted, to be optimized to benchmark audio, benchmark audio after being optimized, the optimization Benchmark audio afterwards, which can be, weakens audio accompaniment, enhances the eigenmatrix of original singer's audio.

For example, as shown in Figure 10 (a) to Figure 10 (d), wherein may include original singer's audio and accompaniment tone in benchmark audio Frequently, audio to be detected can be user male audio or user female's audio, and Figure 10 (a), which can be original singer's audio, to be passed through to audio accompaniment Weaken and characteristic filter after obtained spectrum signature figure, Figure 10 (b) can be obtains audio accompaniment after characteristic filter Spectrum signature figure, Figure 10 (c) can be the spectrum signature figure obtained after characteristic filter to user's male's audio, and Figure 10 (d) can To be the spectrum signature figure obtained after characteristic filter to user female's audio.

The reference characteristic sequence of benchmark audio after available optimization at this time, for example, can be from the reference note after optimization The target audio that intensity of sound is greater than preset threshold is filtered out in frequency, it is alternatively possible to by the sound of the benchmark audio after optimization It is intensities normalised to arrive preset sound strength range, obtain intensity of sound standardized benchmark audio；From intensity of sound standardized benchmark The target audio that intensity of sound is greater than preset threshold is filtered out in audio, and benchmark audio is obtained according to the target audio filtered out Reference characteristic sequence arranged for example, can be ranked up from big to small to the target audio filtered out according to intensity of sound Target audio after sequence；From extracting the maximum preset audio of intensity of sound after sequence in target audio, and from the frequency of preset audio Default dimension frequency sequence is extracted in rate matrix, obtains the reference characteristic sequence of benchmark audio.So as to according to original singer's audio It is relatively strong and weak with audio accompaniment, the audio accompaniment in benchmark audio is weakened, original singer's audio is enhanced, so as to it is subsequent can Precisely to detect the similarity between benchmark audio and audio to be detected.

For example, may include 6 in the reference characteristic sequence of obtained benchmark audio as shown in Figure 11 (a) to Figure 11 (c) The characteristic sequence of original singer's audio is tieed up, may include the feature of 6 Wesy's family male's audios in the characteristic sequence of obtained audio to be detected The characteristic sequence of sequence or 6 Wesy's family female's audios, wherein Figure 11 (a) can be the first dimensional feature sequence of original singer's audio, other 5 dimensional feature sequences are not shown；Figure 11 (b) can be the first dimensional feature sequence of user's male's audio, other 5 dimensional feature sequences are not shown Out；Figure 11 (c) can be the first dimensional feature sequence of user female's audio, other 5 dimensional feature sequences are not shown.

S206, terminal are special to the characteristic sequence of audio to be detected and the benchmark of benchmark audio using extension Manchester's code Sign sequence is encoded, characteristic sequence after being encoded.

It, can characteristic sequence and reference note to audio to be detected in order to improve the accuracy and stability that similarity determines The reference characteristic sequence of frequency is encoded, for example, the coding rule that can use extension Manchester's code is encoded: if special Two neighboring characteristic value changes from low to high in sign sequence, then is encoded to " 1 "；If two neighboring characteristic value is kept in characteristic sequence It is constant, then it is encoded to " 0 "；If two neighboring characteristic value changes from high to low in characteristic sequence, it is encoded to " -1 ".

For example, can since in the characteristic sequence of audio to be detected be located at primary characteristic value, first can will It is 0 positioned at primary Coding pattern features, then, primary characteristic value will be located at and compared with deputy characteristic value is located at Compared with either, can not being encoded to primary characteristic value is located at, directly will be located at primary characteristic value and be located at the Two characteristic values are compared.When primary characteristic value is less than deputy characteristic value, it is encoded to " 1 ", and, when the When one characteristic value is equal to deputy characteristic value, it is encoded to " 0 "；And when primary characteristic value is greater than deputy When characteristic value, it is encoded to " -1 ".Further, deputy characteristic value will be located to compare with the characteristic value for being located at third position Compared with, and so on, it finishes, obtains to be checked until each adjacent two characteristic value in the characteristic sequence of audio to be detected is compared Characteristic sequence after the corresponding coding of acoustic frequency.

Likewise, terminal can be according to the coding rule of the extension Manchester's code to the reference characteristic sequence of benchmark audio Column are encoded, and characteristic sequence after the corresponding coding of benchmark audio is obtained.

For example, may include 6 dimension original singers in characteristic sequence after the coding of benchmark audio as shown in Figure 12 (a) to Figure 12 (c) The coded sequence of audio may include the coded sequence or 6 of 6 Wesy's family male's audios in characteristic sequence after the coding of audio to be detected The coded sequence of Wesy's family female's audio, wherein Figure 12 (a) can be the first dimension coded sequence of original singer's audio, other 5 dimension codings Sequence is not shown；Figure 12 (b) can be the first dimension coded sequence of user's male's audio, other 5 dimension coded sequences are not shown；Figure 12 (c) it can be the first dimension coded sequence of user female's audio, other special coded sequences of 5 dimension are not shown.

Since audio to be detected or benchmark audio are easy by individual difference and Effect of gender, for example, female voice is relative to male The frequency of sound is higher, and different people is different in the base frequency for sending out phone same, pronunciation length also difference etc., therefore uses extension Manchester's code encodes the characteristic sequence of audio to be detected and the reference characteristic sequence of benchmark audio, is compiled by determining The similitude of characteristic sequence characterizes the similarity between audio to be detected and benchmark audio after code, eliminates audio accompaniment, a Influence of the disturbing factors such as body and gender differences to similarity testing result accuracy.

S207, terminal determine after the coding of audio to be detected after characteristic sequence and the coding of benchmark audio between characteristic sequence Editing distance, Euclidean distance and Hamming distance.

Wherein, editing distance can refer to characteristic sequence after the coding for audio to be detected and benchmark audio, by be checked Characteristic sequence is converted into minimum edit operation times needed for characteristic sequence after the coding of benchmark audio after the coding of acoustic frequency.It compiles Volume distance is bigger, illustrates that different characteristic is more between characteristic sequence after the coding of audio to be detected and benchmark audio, conversely, editor Apart from smaller, illustrate that different characteristic is fewer between characteristic sequence after the coding of audio to be detected and benchmark audio, the edit operation It may include that a characteristic character is substituted for one another characteristic character, one characteristic character of insertion and deletion tagged word Symbol etc., this feature character can be " 1 ", " 0 " or " -1 " etc. that coding obtains.Determine characteristic sequence after the coding of audio to be detected And the editing distance after the coding of benchmark audio between characteristic sequence, feature sequence is converted after as determining the coding of audio to be detected At minimum edit operation times needed for characteristic sequence after the coding of benchmark audio, acoustic to be checked can be measured using editing distance Similitude whole between characteristic sequence after characteristic sequence and the coding of benchmark audio after the coding of frequency reduces pronunciation length not The influence that similarity is determined with caused alignment problem etc..

Characteristic sequence is in Euclid's sky after Euclidean distance can refer to the coding of audio to be detected and benchmark audio Between middle point-to-point transmission linear distance, which can be used for measuring characteristic sequence and base after the coding of audio to be detected Difference degree after the coding of quasi- audio between characteristic sequence can determine the coding of audio to be detected according to above-mentioned formula (6) Euclidean distance between characteristic sequence after characteristic sequence and the coding of benchmark audio afterwards.

Hamming distance can refer to that corresponding position is different between characteristic sequence after the coding of audio to be detected and benchmark audio Characteristic character number, i.e., characteristic sequence after the coding of audio to be detected is transformed into characteristic sequence institute after the coding of benchmark audio The number for needing to replace, the Hamming distance can be used for measuring characteristic sequence and benchmark audio after the coding of audio to be detected After coding between characteristic sequence corresponding position absolute consistency.

It, can be according to formula (7) to editing distance, Europe after obtaining editing distance, Euclidean distance and Hamming distance Distance is obtained in several and Hamming distance is normalized respectively.

S208, terminal according to the affine function between editing distance, Euclidean distance and Hamming distance and sub- similarity, It determines each apart from corresponding sub- similarity, and is determined according to sub- similarity similar between audio to be detected and benchmark audio respectively Degree.

For example, terminal can construct in editing distance, Euclidean distance and Hamming distance each distance and sub- similarity it Between affine function, according to it is each determine respectively apart from corresponding affine function it is each apart from corresponding sub- similarity, it is similar according to son Degree determines the similarity between audio and benchmark audio to be detected.

Wherein, establish similarity about the affine function of similarity distance can refer to will normalization obtain editing distance, Euclidean distance and Hamming distance establish both independent variable and dependent variable using similarity as dependent variable as independent variable Between mapping relations, can use affine function by after normalization editing distance, Euclidean distance and Hamming distance it is true Make the sub- similarity being normalized into 0~100 range.

For example, establishing sub- similarity and editing distance D₁Between the first affine function be F (D₁) can be such as above-mentioned formula (8) shown in, sub- similarity and Euclidean distance D are established₂Between the second affine function be F (D₂) can be such as above-mentioned formula (10) shown in, similarity and Hamming distance D are established₃Between third affine function be F (D₃) can be such as above-mentioned formula (12) institute Show.Obtaining editing distance D₁Corresponding first affine function is F (D₁), Euclidean distance D₂Corresponding second affine function For F (D₂) and Hamming distance D₃Corresponding third affine function is F (D₃) after, it can be F (D according to the first affine function₁) really Determine editing distance D₁Corresponding first sub- similarity is F (D according to the second affine function₂) determine Euclidean distance D₂It is corresponding Second sub- similarity, and according to third affine function be F (D₃) determine Hamming distance D₃The corresponding sub- similarity of third, at this time Audio to be detected can be determined according to the first sub- similarity, the second sub- similarity and the sub- similarity of third according to above-mentioned formula (14) Similarity between benchmark audio.

For example, due to editing distance can be used for solving pronounce length or pause etc., can using editing distance as Most important similarity determines component；It, can be with since Hamming distance can be used for the absolute consistency of measures characteristic sequence Component is determined using Hamming distance as the similarity of auxiliary；Due to the geometric distance and difference of Euclidean distance measures characteristic sequence The opposite sex, therefore the penalty term that Euclidean distance can be determined as similarity.At this point it is possible to which the son for editing distance is similar Degree the first weighted value of setting, and it is that the second weighted value is arranged in the sub- similarity of Hamming distance, and by the sub- phase of Euclidean distance It is set as penalty term like degree, audio to be detected and reference note are then determined according to the first weighted value, the second weighted value and penalty term Similarity between frequency.

S209, when similarity is greater than default similarity threshold, terminal executes virtual resource transfer operation, and display is to be checked The relevant information of the similarity testing result of acoustic frequency.

After obtaining the similarity between audio to be detected and benchmark audio, it can be determined that it is default whether the similarity is greater than Similarity threshold,

When similarity is greater than default similarity threshold, user can get red packet, i.e. triggering terminal executes virtual resource Transfer operation, for example, as shown in figure 13, terminal can show the similarities testing result such as the red packet amount of money and the song of user Relevant information；When similarity is less than or equal to default similarity threshold, user cannot get the red packet, and prompt user's weight It sings: the relevant information of the similarities testing result such as " not bringing into play, try again ... ", for example, as shown in figure 14.At this time. Red packet interface can be exited, and audio user can be switched into one section of speech message with grading, the content of the speech message It can be the audio that user follows accompaniment to sing, for example, as shown in figure 15.

Audio to be detected can be sampled in the embodiment of the present invention, framing, adding window and extract the processing such as energy spectrum, with And from the audio for filtering out intensity of sound after processing in audio and being greater than preset threshold, and obtained according to the audio that filters out to be detected The characteristic sequence of audio, and obtain benchmark audio Central Plains and sing root mean square average energy value and energy spectrum of audio and audio accompaniment etc. Optimize benchmark audio, and obtains the reference characteristic sequence of the benchmark audio after optimization；Then, to the feature sequence of audio to be detected The reference characteristic sequence of column and benchmark audio is encoded, and characteristic sequence after the coding of audio to be detected is determined, with benchmark audio Coding after the similarity distances such as editing distance, Euclidean distance and Hamming distance between characteristic sequence, at this time can basis Similarity distance determines the similarity between audio to be detected and benchmark audio, so as to stable and accurate detection acoustic to be checked Similarity between frequency and benchmark audio, the similarity testing result are less by audio accompaniment, environmental noise, individual and gender gap The influence of different equal disturbing factors, improves the accuracy of audio similarity detection.

For convenient for better implementation audio similarity detection method provided in an embodiment of the present invention, the embodiment of the present invention is also mentioned For a kind of device based on above-mentioned audio similarity detection method.The wherein meaning of noun and above-mentioned audio similarity detection method In it is identical, specific implementation details can be with the explanation in pedestal method embodiment.

Figure 16 is please referred to, Figure 16 is the structural schematic diagram of audio similarity detection device provided in an embodiment of the present invention, In the audio similarity detection device may include audio acquiring unit 301, screening unit 302, feature acquiring unit 303, away from From acquiring unit 304 and determination unit 305 etc..

Wherein, audio acquiring unit 301, for obtaining audio to be detected.

Audio acquiring unit 301 can obtain user under the scene that song scores and sing a song as acoustic to be checked Frequently, the audio etc. for or under the scene of sound lock obtaining user's one section of word of recording is obtained as audio to be detected etc., such as audio Unit 301 can use the sound that sample rate is 16KHZ or the audio data format acquisition user of other sample rates speaks or sings Frequency is used as audio to be detected, obtains audio to be detected and can be code rate to be 16bit or the continuous P CM signal of other code rates.

Screening unit 302, for filtering out the audio for meeting preset condition from audio to be detected, and according to filtering out Audio obtains the characteristic sequence of audio to be detected.

In some embodiments, as shown in figure 17, screening unit 302 may include:

It handles subelement 3021 and obtains pretreated audio for pre-processing to audio to be detected；

Subelement 3022 is obtained, for obtaining the energy spectrum of pretreated audio；

Subelement 3023 is screened, for filtering out from pretreated audio and meeting preset condition according to energy spectrum Audio, and set the corresponding frequency sequence of the audio filtered out to the characteristic sequence of the audio to be detected.

Firstly, screening for convenience to audio to be detected, processing subelement 3021 can be carried out audio to be detected Pretreatment, in some embodiments, processing subelement 3021 be specifically used for: according to default sampling policy to audio to be detected into Row sampling, the audio after being sampled；Sub-frame processing is carried out to the audio after sampling according to default framing strategy, after obtaining framing Audio；Windowing process is carried out to the audio after framing, obtains discrete pretreated audio.

Specifically, processing subelement 3021 successively audio to be detected can be sampled, the processing such as framing and adding window, For example, can according to default sampling policy using sample frequency be 44100HZ or other sample frequencys etc., to audio to be detected into Row sampling, the audio after being sampled, the default sampling policy can be the sampling policy for meeting nyquist sampling law.So It afterwards, is the half of frame length for 512 or 1024 sampled points and frame shifting according to the default framing strategy such as framing length that uses Or one third etc., sub-frame processing is carried out to the audio after sampling, the audio after obtaining framing can then use Hamming window letter Number, rectangular window function or hamming window function etc. carry out windowing process to the audio after framing, obtain discrete pretreated sound Frequently.

Then, it obtains the energy spectrum that subelement 3022 obtains pretreated audio and obtains son in some embodiments Unit 3022 is specifically used for: carrying out integral transformation to pretreated audio, obtains the corresponding frequency spectrum of pretreated audio；Root The energy spectrum of pretreated audio is determined according to frequency spectrum.

For example, 2048 points or 1024 points etc. can be carried out in short-term to pretreated audio by obtaining subelement 3022 Integral transformation obtains the corresponding frequency spectrum of each frame audio in pretreated audio, then takes to the frequency spectrum of pretreated audio Mould square, obtains the corresponding energy spectrum of pretreated audio, which can be what every frame audio was distributed in each frequency Matrix composed by energy size.

Secondly, in order to filter out the lower interference tones of intensity of sound, it can be based on the energy of pretreated audio Spectrum, filters out the audio that intensity of sound meets preset condition from audio to be detected, in some embodiments, screens subelement 3023 may include:

Module is obtained, for obtaining the intensity of sound of audio to be detected according to energy spectrum；

Screening module, the audio for being greater than preset threshold for filtering out intensity of sound from audio to be detected, obtains sound Intensity meets the audio of preset condition.

For example, the intensity of sound of audio to be detected can be determined according to above-mentioned formula (1) by obtaining module, mould is screened at this time Block can filter out the audio that intensity of sound is greater than preset threshold from audio to be detected, obtain intensity of sound and meet preset condition Audio, so as to which the lower interference tones of intensity of sound are filtered out, which can carry out according to actual needs Flexible setting, specific value are not construed as limiting here.

In some embodiments, screening module is specifically used for: the intensity of sound of audio to be detected being normalized into default Intensity of sound range obtains intensity of sound standardized audio；Intensity of sound is filtered out from intensity of sound standardized audio to be greater than The audio of preset threshold obtains the audio that intensity of sound meets preset condition.

For example, the intensity of sound P of audio to be detected can be normalized into 0~b points according to above-mentioned formula (2) by screening module Shellfish (db) meets the Auditory Perception range of people, can will be lower than the sound of preset threshold in intensity of sound standardized audio at this time Intensity zero setting will be higher than screening for preset threshold in intensity of sound standardized audio, obtain intensity of sound and meet default item The audio of part, due in audio to be detected accompaniment and background sound etc. be all interference tones, the preset threshold is arranged can be to dry Audio is disturbed rationally to be filtered.

In some embodiments, screening subelement 3023 can will sieve after filtering out and meeting the audio of preset condition The corresponding frequency sequence of the audio selected is set as the characteristic sequence of audio to be detected, for example, can to the audio filtered out by It is ranked up from big to small according to intensity of sound, audio after being sorted；From maximum acoustic intensity is extracted after sequence in audio Audio, the corresponding frequency sequence of the audio of maximum acoustic intensity are exactly the characteristic sequence of audio to be detected.

For example, the audio filtered out can be ranked up by sorting subunit from big to small by intensity of sound, sorted Audio afterwards, then, extract subelement from extracted in audio after sequence the maximum preset audio of intensity of sound (such as it is preceding 6 tie up most The audio of loud noise intensity), and default dimension frequency sequence (such as 6 dimensions) are extracted from the frequency matrix of preset audio, such as The frequency sequence of sextuple maximum intensity of sound before every frame audio is extracted, which is finally obtained acoustic to be checked The characteristic sequence of frequency.

It needs, when there are such as audio accompaniment interference tones in audio to be detected, such as in audio to be detected Including audio user and audio accompaniment, in order to improve the accuracy of subsequent determining similarity, screening unit 302 can be by accompaniment tone Frequency is weakened.Optionally, during obtaining the characteristic sequence of audio to be detected, the available user of screening unit 302 The root mean square average energy value of audio, and obtain the root mean square average energy value of audio accompaniment；The energy spectrum of audio user is obtained, with And obtain the energy spectrum of audio accompaniment；According to the energy spectrum of audio user, the root mean square average energy value of audio user, audio accompaniment Root mean square average energy value and audio accompaniment energy spectrum, audio to be detected is optimized, audio to be detected after being optimized； Obtain the characteristic sequence of audio to be detected after optimization.

Optionally, screening unit 302 can also determine the root mean square energy of audio user, and determine the equal of audio accompaniment Root energy；The frame number and frame length of audio user are obtained, and obtains the frame number and frame length of audio accompaniment；According to audio user Root mean square energy, frame number and frame length determine the root mean square average energy value of audio user, and the root mean square energy according to audio accompaniment Amount, frame number and frame length determine the root mean square average energy value of audio accompaniment.It can be according to audio user and companion in the embodiment of the present invention The relatively strong and weak of audio is played, audio accompaniment decrease has been carried out to audio to be detected, has enhanced audio user, therefore can precisely examine Measure the similarity between benchmark audio and audio user.

Feature acquiring unit 303, for obtaining the reference characteristic sequence of benchmark audio.

Optionally, after obtaining benchmark audio, it is default that feature acquiring unit 303 can filter out satisfaction from benchmark audio The target audio of condition, and according to the reference characteristic sequence of the target audio acquisition benchmark audio filtered out.Optionally, feature obtains Take unit 303 that can pre-process to benchmark audio, benchmark audio after being pre-processed；Obtain benchmark audio after pre-processing Energy spectrum；The target audio for meeting preset condition, and the target sound that will be filtered out are filtered out from benchmark audio according to energy spectrum Frequently corresponding frequency sequence is set as the reference characteristic sequence of benchmark audio.

Benchmark audio is screened for convenience, benchmark audio can be pre-processed, optionally, feature obtains single Member 303 can sample benchmark audio according to default sampling policy, benchmark audio after being sampled；According to default framing plan Sub-frame processing slightly is carried out to benchmark audio after sampling, obtains benchmark audio after framing；Benchmark audio after framing is carried out at adding window Reason, obtains benchmark audio after the pretreatment of discrete time-domain.Optionally, feature acquiring unit 303 can be to reference note after pretreatment Frequency carries out integral transformation, the corresponding frequency spectrum of benchmark audio after being pre-processed；Benchmark audio after pre-processing is determined according to frequency spectrum Energy spectrum.Optionally, feature acquiring unit 303 can obtain the intensity of sound of benchmark audio according to energy spectrum；From benchmark audio In filter out intensity of sound be greater than preset threshold audio, obtain the target audio that intensity of sound meets preset condition.Optionally, The intensity of sound of benchmark audio can be normalized into preset sound strength range by feature acquiring unit 303, obtain intensity of sound Standardized benchmark audio；The audio that intensity of sound is greater than preset threshold is filtered out from intensity of sound standardized benchmark audio, is obtained Meet the target audio of preset condition to intensity of sound.

Optionally, feature acquiring unit 303 can carry out the target audio filtered out according to intensity of sound from big to small Sequence, target audio after being sorted；From the maximum audio of intensity of sound is extracted after sequence in target audio, maximum acoustic is strong The corresponding frequency sequence of the audio of degree is exactly the characteristic sequence of benchmark audio.Such as extract sextuple maximum sound before every frame audio The frequency sequence of loudness of a sound degree, the frequency sequence are the characteristic sequence of finally obtained benchmark audio.Due to audio sheet to be detected Body has the characteristics that pause or strong and weak, and individual features also have the differentiation of length or size in time domain and frequency domain, for be detected The characteristics of audio, pre-processes benchmark audio, obtains energy spectrum, spectrum signature is filtered and is arranged according to energy size Sequence, n dimension maximum feature of energy etc. before filtering out, so as to reduce error caused by subsequent determining similarity.

In some embodiments, as shown in figure 18, when in benchmark audio including target fiducials audio and interference tones, Feature acquiring unit 303 may include:

Mean value obtains subelement 3031, for obtaining the first root mean square average energy value of target fiducials audio, and acquisition Second root mean square average energy value of interference tones；

Energy spectrum obtains subelement 3032, for obtaining the first energy spectrum of target fiducials audio, and acquisition interference sound Second energy spectrum of frequency；

Optimize subelement 3033, for equal according to the first energy spectrum, the first root mean square average energy value, the second root mean square energy Value and the second energy spectrum, optimize benchmark audio, the benchmark audio after being optimized；

Feature obtains subelement 3034, for obtaining the reference characteristic sequence of the benchmark audio after optimizing.

In some embodiments, mean value obtain subelement 3031 be specifically used for: determine target fiducials audio first Root energy, and determine the second root mean square energy of interference tones；Obtain the first frame number and first frame of target fiducials audio It is long, and obtain the second frame number and the second frame length of interference tones；According to the first root mean square energy, the first frame number and the first frame length Determine the first root mean square average energy value of target fiducials audio, and according to the second root mean square energy, the second frame number and the second frame Long the second root mean square average energy value for determining interference tones.So as to according to the relatively strong of target fiducials audio and interference tones It is weak, (such as audio accompaniment) has been carried out to the interference tones in benchmark audio and has been weakened, the target fiducials sound for comparing is enhanced Frequently (such as original singer's audio), therefore can precisely detect the similarity between benchmark audio and audio to be detected.

Distance acquiring unit 304, the reference characteristic sequence for obtaining the characteristic sequence of audio to be detected, with benchmark audio Between similarity distance.

In some embodiments, as shown in figure 19, distance acquiring unit 304 includes:

Coded sub-units 3041 are obtained for encoding according to characteristic sequence of the pre-arranged code strategy to audio to be detected Characteristic sequence to after the first coding, and encoded according to reference characteristic sequence of the pre-arranged code strategy to benchmark audio, Characteristic sequence after obtaining the second coding；

First determines subelement 3042, for determining the characteristic sequence after the first coding and the characteristic sequence after the second coding Between similarity distance.

In some embodiments, coded sub-units 3041 are specifically used for: according to pre-arranged code strategy by audio to be detected Characteristic sequence in, each adjacent two characteristic value carry out size comparison；When characteristic value previous in two neighboring characteristic value is less than When later feature value, the characteristic sequence of audio to be detected is encoded to the first encoded radio, and, when in two neighboring characteristic value When previous characteristic value is equal to later feature value, the characteristic sequence of audio to be detected is encoded to the second encoded radio；And when When previous characteristic value is greater than later feature value in two neighboring characteristic value, the characteristic sequence of audio to be detected is encoded to the Three encoded radios；Characteristic sequence after generating the first coding based on the first encoded radio, the second encoded radio and/or third encoded radio.

For the pre-arranged code strategy for extending Manchester's code, the coding rule of the extension Manchester's code can be with Are as follows: if two neighboring characteristic value changes from low to high in characteristic sequence, is encoded to the first encoded radio, such as is encoded to " 1 "；If Two neighboring characteristic value remains unchanged in characteristic sequence, then is encoded to the second encoded radio, such as is encoded to " 0 "；If characteristic sequence In two neighboring characteristic value change from high to low, then be encoded to third encoded radio, such as be encoded to " -1 ".

Likewise, it is directed to benchmark audio, it can also be according to the coding rule of the extension Manchester's code to benchmark audio Reference characteristic sequence encoded, in some embodiments, coded sub-units 3041 are specifically used for: according to pre-arranged code plan Slightly by the characteristic sequence of benchmark audio, each adjacent two characteristic value carries out size comparison；When previous in two neighboring characteristic value When a characteristic value is less than later feature value, the characteristic sequence of benchmark audio is encoded to the first encoded radio, and, when adjacent two When previous characteristic value is equal to later feature value in a characteristic value, the characteristic sequence of benchmark audio is encoded to the second coding Value；And when characteristic value previous in two neighboring characteristic value is greater than later feature value, by the characteristic sequence of benchmark audio It is encoded to third encoded radio；Feature after generating the second coding based on the first encoded radio, the second encoded radio and/or third encoded radio Sequence.

In some embodiments, similarity distance include at least editing distance, Euclidean distance and Hamming distance, first Determine that subelement 3042 is specifically used for: between the characteristic sequence after characteristic sequence and the second coding after at least determining the first coding Editing distance, Euclidean distance and Hamming distance；Editing distance, Euclidean distance and Hamming distance are returned respectively One changes, and obtains similarity distance.

Wherein, editing distance can be pointer for characteristic sequence after two codings, by feature sequence after one of coding Column are converted into minimum edit operation times needed for characteristic sequence after another is encoded.Editing distance is bigger, illustrates two codings Different characteristic is more between characteristic sequence afterwards, conversely, editing distance is smaller, illustrates different special between characteristic sequence after two codings Levy it is fewer, the edit operation may include a characteristic character is substituted for another characteristic character, insertion one characteristic character, And delete a characteristic character etc., this feature character can be " 1 ", " 0 " or " -1 " etc. that coding obtains.First determines subelement The editing distance between the characteristic sequence after characteristic sequence and the second coding after 3042 determining first codings, that is, determine that first compiles Minimum edit operation times needed for characteristic sequence after code is converted into the characteristic sequence after the second coding, can using editing distance To measure the similitude of the two characteristic sequences entirety such as the characteristic sequence after the first coding and the characteristic sequence after the second coding, Preferably solves alignment problem as caused by pronunciation length difference etc..

Euclidean distance can refer to that the characteristic sequence after the first coding and the characteristic sequence after the second coding are several in Europe In the linear distance of point-to-point transmission in space, in the embodiment of the present invention Euclidean distance be used to measure first encode after feature Difference degree between the two characteristic sequences such as the characteristic sequence after sequence and the second coding.Such as first determine subelement 3042 can determine that Europe is several between the characteristic sequence after the first coding and the characteristic sequence after the second coding according to above-mentioned formula (6) In distance d₂。

Obtaining editing distance d₁, Euclidean distance d₂With Hamming distance d₃Afterwards, first determine that subelement 3042 can be right Editing distance, Euclidean distance and Hamming distance are normalized respectively, obtain similarity distance.

Determination unit 305, for determining the similarity between audio to be detected and benchmark audio according to similarity distance.

In some embodiments, as shown in figure 20, determination unit 305 includes:

Subelement 3051 is constructed, for constructing each distance and sub- phase in editing distance, Euclidean distance and Hamming distance Like the affine function between degree；

Second determines subelement 3052, for according to it is each determine respectively apart from corresponding affine function it is each apart from corresponding son Similarity；

Third determines subelement 3053, similar between audio to be detected and benchmark audio for being determined according to sub- similarity Degree.

Wherein, building subelement 3051, which establishes similarity, can refer to and will normalize about the affine function of similarity distance Editing distance, Euclidean distance and the Hamming distance arrived establishes independent variable using similarity as dependent variable as independent variable Mapping relations between dependent variable the two.Can use affine function by after normalization editing distance, Euclid away from From and Hamming distance determine the sub- similarity being normalized into 0~100 range.

Such as building subelement 3051 can establish sub- similarity and editing distance D₁Between the first affine function be F (D₁), shown in expression formula such as above-mentioned formula (8)；Establish sub- similarity and Euclidean distance D₂Between the second affine function For F (D₂), shown in expression formula such as above-mentioned formula (10)；Establish sub- similarity and Hamming distance D₃Between third affine function For F (D₃), shown in expression formula such as above-mentioned formula (12).

Obtaining editing distance D₁Corresponding first affine function is F (D₁), Euclidean distance D₂Corresponding second is affine Function is F (D₂) and Hamming distance D₃Corresponding third affine function is F (D₃) after, second determines that subelement 3052 can basis First affine function is F (D₁) determine editing distance D₁Corresponding first sub- similarity is F (D according to the second affine function₂) really Determine Euclidean distance D₂Corresponding second sub- similarity, and according to third affine function be F (D₃) determine Hamming distance D₃It is right The sub- similarity of the third answered, third determines that subelement 3053 can be according to the first sub- similarity, the second sub- similarity and at this time Three sub- similarities determine the similarity between audio to be detected and benchmark audio.

In some embodiments, third determines that subelement 3053 is specifically used for: being arranged for the sub- similarity of editing distance First weighted value, and the second weighted value is set for the sub- similarity of Hamming distance；The sub- similarity of Euclidean distance is arranged For penalty term；According to the first weighted value, the second weighted value and penalty term, determine similar between audio to be detected and benchmark audio Degree.

For example, since editing distance overcomes pronunciation length or pause etc., and the characteristic with strong antijamming capability, because This can determine component for editing distance as most important similarity；Since Hamming distance has for measures characteristic sequence The characteristic of absolute consistency, therefore component can be determined using Hamming distance as the similarity of auxiliary；Due to Euclidean distance The geometric distance of measures characteristic sequence, the characteristic of the difference of prominent features sequence, therefore can be using Euclidean distance as phase Like the determining penalty term of degree.At this point, third determines that the first weight can be arranged for the sub- similarity of editing distance in subelement 3053 Value, and the second weighted value is set for the sub- similarity of Hamming distance, and the sub- similarity of Euclidean distance is set as punishing , wherein the value of the first weighted value and the second weighted value can carry out flexible setting according to actual needs, then third is determined Subelement 3053 determines similar between audio to be detected and benchmark audio according to the first weighted value, the second weighted value and penalty term Degree, calculation formula can be as shown in above-mentioned formula (14).

In some embodiments, audio similarity detection device can also include: resource transfers unit, for when to be checked When similarity between acoustic frequency and benchmark audio is greater than default similarity threshold, virtual resource transfer operation is executed, and/or aobvious Show the relevant information of the similarity testing result of audio to be detected.

In some embodiments, audio similarity detection device can also include: unlocking unit, for working as acoustic to be checked When similarity between frequency and benchmark audio is greater than default similarity threshold, audio lock operation is unlocked in execution.

From the foregoing, it will be observed that the embodiment of the present invention can be obtained audio and screening unit to be detected by audio acquiring unit 301 302 filter out the audio for meeting preset condition from the audio to be detected, and obtain audio to be detected according to the audio filtered out Characteristic sequence, so as to the interference tones in audio to be detected are filtered and filtered out required audio frequency characteristics, with And the reference characteristic sequence of benchmark audio is obtained by feature acquiring unit 303；Then, distance acquiring unit 304 obtains to be detected Similarity distance between the characteristic sequence of audio, and the reference characteristic sequence of benchmark audio, for example, editing distance, Euclid away from From with Hamming distance etc., which can reduce influence of many factors to similarity testing result, at this time determination unit 305 can determine the similarity between audio to be detected and benchmark audio according to similarity distance, improve audio similarity detection Accuracy.

Correspondingly, the embodiment of the present invention also provides a kind of computer equipment, the computer equipment may include tablet computer, The terminals such as mobile phone and laptop, as shown in figure 21, the computer equipment may include radio frequency (RF, Radio Frequency) circuit 601, include one or more memory 602, the input unit of determining machine readable storage medium storing program for executing 603, display unit 604, sensor 605, voicefrequency circuit 606, Wireless Fidelity (WiFi, Wireless Fidelity) module 607, the components such as processor 608 and the power supply 609 of processing core are included one or more than one.Those skilled in the art Member is appreciated that computer equipment structure shown in Figure 21 does not constitute the restriction to computer equipment, may include than figure Show more or fewer components, perhaps combines certain components or different component layouts.Wherein:

RF circuit 601 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, one or the processing of more than one processor 608 are transferred to；In addition, the data for being related to uplink are sent to Base station.In general, RF circuit 601 includes but is not limited to antenna, at least one amplifier, tuner, one or more oscillators, uses Family identity module (SIM, Subscriber Identity Module) card, transceiver, coupler, low-noise amplifier (LNA, Low Noise Amplifier), duplexer etc..In addition, RF circuit 601 can also by wireless communication with network and its He communicates equipment.Any communication standard or agreement, including but not limited to global system for mobile telecommunications system can be used in the wireless communication Unite (GSM, Global System of Mobile communication), general packet radio service (GPRS, General Packet Radio Service), CDMA (CDMA, Code Division Multiple Access), wideband code division it is more Location (WCDMA, Wideband Code Division Multiple Access), long term evolution (LTE, Long Term Evolution), Email, short message service (SMS, Short Messaging Service) etc..

Memory 602 can be used for storing software program and module, and processor 608 is stored in memory 602 by operation Software program and module, thereby executing various function application and data processing.Memory 602 can mainly include storage journey Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function Such as sound-playing function, image player function) etc.；Storage data area can be stored to be created according to using for computer equipment Data (such as audio data, phone directory etc.) etc..It, can be in addition, memory 602 may include high-speed random access memory Including nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-states Part.Correspondingly, memory 602 can also include Memory Controller, to provide processor 608 and 603 pairs of input unit storages The access of device 602.

Input unit 603 can be used for receiving the number or character information of input, and generate and user setting and function Control related keyboard, mouse, operating stick, optics or trackball signal input.Specifically, in a specific embodiment In, input unit 603 may include touch sensitive surface and other input equipments.Touch sensitive surface, also referred to as touch display screen or touching Control plate, collect user on it or nearby touch operation (such as user using any suitable object such as finger, stylus or Operation of the attachment on touch sensitive surface or near touch sensitive surface), and corresponding connection dress is driven according to preset formula It sets.Optionally, touch sensitive surface may include both touch detecting apparatus and touch controller.Wherein, touch detecting apparatus is examined The touch orientation of user is surveyed, and detects touch operation bring signal, transmits a signal to touch controller；Touch controller from Touch information is received on touch detecting apparatus, and is converted into contact coordinate, then gives processor 608, and can reception processing Order that device 608 is sent simultaneously is executed.Furthermore, it is possible to a variety of using resistance-type, condenser type, infrared ray and surface acoustic wave etc. Type realizes touch sensitive surface.In addition to touch sensitive surface, input unit 603 can also include other input equipments.Specifically, other are defeated Entering equipment can include but is not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse One of mark, operating stick etc. are a variety of.

Display unit 604 can be used for showing information input by user or the information and computer equipment that are supplied to user Various graphical user interface, these graphical user interface can be by figure, text, icon, video and any combination thereof come structure At.Display unit 604 may include display panel, optionally, can use liquid crystal display (LCD, Liquid Crystal Display), the forms such as Organic Light Emitting Diode (OLED, Organic Light-Emitting Diode) configure display surface Plate.Further, touch sensitive surface can cover display panel, after touch sensitive surface detects touch operation on it or nearby, Processor 608 is sent to determine the type of touch event, is followed by subsequent processing device 608 according to the type of touch event in display panel It is upper that corresponding visual output is provided.Although touch sensitive surface and display panel are come in fact as two independent components in Figure 21 Now input and input function, but in some embodiments it is possible to touch sensitive surface and display panel is integrated and realize input and Output function.

Computer equipment may also include at least one sensor 605, such as optical sensor, motion sensor and other biographies Sensor.Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ring The light and shade of border light adjusts the brightness of display panel, and proximity sensor can close aobvious when computer equipment is moved in one's ear Show panel and/or backlight.As a kind of motion sensor, gravity accelerometer can detect in all directions (generally Three axis) acceleration size, can detect that size and the direction of gravity when static, can be used to identify answering for computer equipment posture With (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, percussion) Deng；Other sensings such as gyroscope, barometer, hygrometer, thermometer, infrared sensor for can also configure as computer equipment Device, details are not described herein.

Voicefrequency circuit 606, loudspeaker, microphone can provide the audio interface between user and computer equipment.Audio-frequency electric Electric signal after the audio data received conversion can be transferred to loudspeaker, it is defeated to be converted to voice signal by loudspeaker by road 606 Out；On the other hand, the voice signal of collection is converted to electric signal by microphone, is converted to audio after being received by voicefrequency circuit 606 Data, then by after the processing of audio data output processor 608, such as another computer equipment is sent to through RF circuit 601, or Person exports audio data to memory 602 to be further processed.Voicefrequency circuit 606 is also possible that earphone jack, to mention For the communication of peripheral hardware earphone and computer equipment.

WiFi belongs to short range wireless transmission technology, and computer equipment can help user to receive and dispatch by WiFi module 607 Email, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 21 WiFi module 607 is shown, but it is understood that, and it is not belonging to must be configured into for computer equipment, it completely can root It is omitted within the scope of not changing the essence of the invention according to needs.

Processor 608 is the control centre of computer equipment, utilizes various interfaces and the entire computer equipment of connection Various pieces, by running or execute the software program and/or module that are stored in memory 602, and call and be stored in Data in memory 602 execute the various functions and processing data of computer equipment, to carry out to computer equipment whole Monitoring.Optionally, processor 608 may include one or more processing cores；Preferably, processor 608 can be integrated using processing Device and modem processor, wherein the main processing operation system of application processor, user interface and application program etc., modulation Demodulation processor mainly handles wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processing In device 608.

Computer equipment further includes the power supply 609 (such as battery) powered to all parts, it is preferred that power supply can pass through Power-supply management system and processor 608 are logically contiguous, to realize management charging, electric discharge, Yi Jigong by power-supply management system The functions such as consumption management.Power supply 609 can also include one or more direct current or AC power source, recharging system, power supply The random components such as fault detection circuit, power adapter or inverter, power supply status indicator.

Although being not shown, computer equipment can also include camera, bluetooth module etc., and details are not described herein.Specifically exist In the present embodiment, the processor 608 in computer equipment can be according to following instruction, by one or more application program The corresponding executable file of process be loaded into memory 602, and run by processor 608 storage in the memory 602 Application program, to realize various functions:

Obtain audio to be detected；Filter out the audio for meeting preset condition from audio to be detected, and according to filtering out Audio obtains the characteristic sequence of audio to be detected；Obtain the reference characteristic sequence of benchmark audio；Obtain the feature of audio to be detected Similarity distance between sequence, and the reference characteristic sequence of benchmark audio；Audio to be detected and benchmark are determined according to similarity distance Similarity between audio.

Optionally, processor 608 runs storage application program in the memory 602, can also realize following functions: Audio to be detected is pre-processed, pretreated audio is obtained；Obtain the energy spectrum of pretreated audio；According to energy Spectrum, filters out the audio for meeting preset condition from pretreated audio, and by the corresponding frequency sequence of the audio filtered out It is set as the characteristic sequence of audio to be detected.

Optionally, processor 608 runs storage application program in the memory 602, can also realize following functions: The first root mean square average energy value of target fiducials audio is obtained, and obtains the second root mean square average energy value of interference tones；It obtains The first energy spectrum of target fiducials audio is taken, and obtains the second energy spectrum of interference tones；According to the first energy spectrum, first Root average energy value, the second root mean square average energy value and the second energy spectrum, optimize benchmark audio, the base after being optimized Quasi- audio；The reference characteristic sequence of benchmark audio after obtaining optimization.

Optionally, processor 608 runs storage application program in the memory 602, can also realize following functions: It is encoded according to characteristic sequence of the pre-arranged code strategy to audio to be detected, the characteristic sequence after obtaining the first coding, and It is encoded according to reference characteristic sequence of the pre-arranged code strategy to benchmark audio, the characteristic sequence after obtaining the second coding；Really The similarity distance between the characteristic sequence after characteristic sequence and the second coding after fixed first coding.

Optionally, processor 608 runs storage application program in the memory 602, can also realize following functions: The editing distance between the characteristic sequence after characteristic sequence and the second coding after at least determining the first coding, Euclidean distance And Hamming distance；Editing distance, Euclidean distance and Hamming distance are normalized respectively, obtain similarity distance.

Optionally, processor 608 runs storage application program in the memory 602, can also realize following functions: Construct the affine function in editing distance, Euclidean distance and Hamming distance between each distance and sub- similarity；According to respectively away from Corresponding sub- similarity with a distance from each is determined respectively from corresponding affine function；Audio to be detected and benchmark are determined according to sub- similarity Similarity between audio.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the detailed description above with respect to audio similarity detection method, details are not described herein again.

It will appreciated by the skilled person that all or part of the steps in the various methods of above-described embodiment can be with It is completed by instructing, or relevant hardware is controlled by instruction to complete, which, which can store, determines that machine is readable in one and deposit In storage media, and is loaded and executed by processor.

For this purpose, the embodiment of the present invention provides a kind of storage medium, wherein being stored with a plurality of instruction, which can be processed Device is loaded, to execute the step in any audio similarity detection method provided by the embodiment of the present invention.For example, should Instruction can execute following steps:

Optionally, which can also be performed following steps: pre-processing, obtains pretreated to audio to be detected Audio；Obtain the energy spectrum of pretreated audio；According to energy spectrum, the default item of satisfaction is filtered out from pretreated audio The audio of part, and set the corresponding frequency sequence of the audio filtered out to the characteristic sequence of audio to be detected.

Optionally, which can also be performed following steps: the first root mean square average energy value of target fiducials audio is obtained, And obtain the second root mean square average energy value of interference tones；The first energy spectrum of target fiducials audio is obtained, and is obtained dry Disturb the second energy spectrum of audio；According to the first energy spectrum, the first root mean square average energy value, the second root mean square average energy value and second Energy spectrum optimizes benchmark audio, the benchmark audio after being optimized；The reference characteristic of benchmark audio after obtaining optimization Sequence.

Optionally, which can also be performed following steps: according to pre-arranged code strategy to the feature sequence of audio to be detected Column are encoded, the characteristic sequence after obtaining the first coding, and according to pre-arranged code strategy to the reference characteristic of benchmark audio Sequence is encoded, the characteristic sequence after obtaining the second coding；After characteristic sequence and the second coding after determining the first coding Similarity distance between characteristic sequence.

Optionally, which can also be performed following steps: characteristic sequence and second after at least determining the first coding are compiled Editing distance, Euclidean distance and the Hamming distance between characteristic sequence after code；To editing distance, Euclidean distance and Hamming distance is normalized respectively, obtains similarity distance.

Optionally, which can also be performed following steps: in building editing distance, Euclidean distance and Hamming distance Affine function between each distance and sub- similarity；According to it is each determine respectively apart from corresponding affine function it is each apart from corresponding son Similarity；The similarity between audio to be detected and benchmark audio is determined according to sub- similarity.

The specific implementation of above each operation can be found in the embodiment of front, and details are not described herein.

Wherein, which may include: read-only memory (ROM, Read Only Memory), random access memory Body (RAM, Random Access Memory), disk or CD etc..

By the instruction stored in the storage medium, any audio phase provided by the embodiment of the present invention can be executed Like the step in degree detection method, it is thereby achieved that any audio similarity detection side provided by the embodiment of the present invention Beneficial effect achieved by method is detailed in the embodiment of front, and details are not described herein.

It is provided for the embodiments of the invention a kind of audio similarity detection method, device, storage medium and calculating above Machine equipment is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, The above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile for the skill of this field Art personnel, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this Description should not be construed as limiting the invention.

Claims

1. a kind of audio similarity detection method characterized by comprising

Obtain audio to be detected；

The audio for meeting preset condition is filtered out from the audio to be detected, and described to be checked according to the audio acquisition filtered out The characteristic sequence of acoustic frequency；

Obtain the reference characteristic sequence of benchmark audio；

Similarity distance between the characteristic sequence for obtaining the audio to be detected, and the reference characteristic sequence of the benchmark audio；

2. audio similarity detection method according to claim 1, which is characterized in that described from the audio to be detected The audio for meeting preset condition is filtered out, and obtains the characteristic sequence of the audio to be detected according to the audio filtered out, comprising:

The audio to be detected is pre-processed, pretreated audio is obtained；

Obtain the energy spectrum of the pretreated audio；

According to the energy spectrum, the audio for meeting preset condition is filtered out from the pretreated audio, and will be filtered out The corresponding frequency sequence of audio be set as the characteristic sequence of the audio to be detected.

3. audio similarity detection method according to claim 2, which is characterized in that it is described to the audio to be detected into Row pretreatment, obtains pretreated audio, comprising:

4. audio similarity detection method according to claim 2, which is characterized in that the acquisition is described pretreated The energy spectrum of audio, comprising:

5. audio similarity detection method according to claim 2, which is characterized in that it is described according to the energy spectrum, from The audio for meeting preset condition is filtered out in the pretreated audio, comprising:

The intensity of sound of the audio to be detected is obtained according to the energy spectrum；

Filtered out from the audio to be detected intensity of sound be greater than preset threshold audio, obtain intensity of sound meet it is described pre- If the audio of condition.

6. audio similarity detection method according to claim 5, which is characterized in that described from the audio to be detected The audio that intensity of sound is greater than preset threshold is filtered out, the audio that intensity of sound meets the preset condition is obtained, comprising:

The intensity of sound of the audio to be detected is normalized into preset sound strength range, obtains intensity of sound standardization sound Frequently；

The audio that intensity of sound is greater than preset threshold is filtered out from the intensity of sound standardized audio, and it is full to obtain intensity of sound The audio of the foot preset condition.

7. audio similarity detection method according to claim 1, which is characterized in that when in the benchmark audio include mesh When marking benchmark audio and interference tones, the reference characteristic sequence for obtaining benchmark audio, comprising:

The first root mean square average energy value of the target fiducials audio is obtained, and obtains the second root mean square of the interference tones Average energy value；

The first energy spectrum of the target fiducials audio is obtained, and obtains the second energy spectrum of the interference tones；

According to first energy spectrum, the first root mean square average energy value, the second root mean square average energy value and the second energy spectrum, to institute It states benchmark audio to optimize, the benchmark audio after being optimized；

The reference characteristic sequence of benchmark audio after obtaining the optimization.

8. audio similarity detection method according to claim 7, which is characterized in that described to obtain the target fiducials sound First root mean square average energy value of frequency, and obtain the second root mean square average energy value of the interference tones, comprising:

It determines the first root mean square energy of the target fiducials audio, and determines the second root mean square energy of the interference tones Amount；

Obtain the first frame number and the first frame length of the target fiducials audio, and obtain the interference tones the second frame number and Second frame length；

The first root mean square of the target fiducials audio is determined according to the first root mean square energy, the first frame number and the first frame length Average energy value, and the second of the interference tones are determined according to the second root mean square energy, the second frame number and the second frame length Root mean square average energy value.

9. audio similarity detection method according to any one of claims 1 to 8, which is characterized in that described in the acquisition Similarity distance between the characteristic sequence of audio to be detected, and the reference characteristic sequence of the benchmark audio, comprising:

It is encoded according to characteristic sequence of the pre-arranged code strategy to the audio to be detected, the feature sequence after obtaining the first coding Column, and encoded according to reference characteristic sequence of the pre-arranged code strategy to the benchmark audio, obtain the second coding Characteristic sequence afterwards；

The similarity distance between the characteristic sequence after characteristic sequence and the second coding after determining first coding.

10. audio similarity detection method according to claim 9, which is characterized in that described according to pre-arranged code strategy The characteristic sequence of the audio to be detected is encoded, the characteristic sequence after obtaining the first coding, comprising:

According to pre-arranged code strategy by the characteristic sequence of the audio to be detected, each adjacent two characteristic value carries out size ratio Compared with；

When characteristic value previous in two neighboring characteristic value is less than later feature value, by the feature sequence of the audio to be detected Column are encoded to the first encoded radio, and,

When characteristic value previous in two neighboring characteristic value is equal to later feature value, by the feature sequence of the audio to be detected Column are encoded to the second encoded radio；And

When characteristic value previous in two neighboring characteristic value is greater than later feature value, by the feature sequence of the audio to be detected Column are encoded to third encoded radio；

Characteristic sequence after generating the first coding based on first encoded radio, the second encoded radio and/or third encoded radio.

11. audio similarity detection method according to claim 9, which is characterized in that the similarity distance includes at least Editing distance, Euclidean distance and Hamming distance, after the characteristic sequence and the second coding after the determination first coding Characteristic sequence between similarity distance, comprising:

The editing distance between the characteristic sequence after characteristic sequence and the second coding after at least determining first coding, Europe are several In distance and Hamming distance；

12. audio similarity detection method according to claim 11, which is characterized in that described according to the similarity distance Determine the similarity between the audio to be detected and benchmark audio, comprising:

Construct the affine function in editing distance, Euclidean distance and Hamming distance between each distance and sub- similarity；

According to it is each determine respectively apart from corresponding affine function it is each apart from corresponding sub- similarity；

The similarity between the audio to be detected and benchmark audio is determined according to the sub- similarity.

13. audio similarity detection method according to claim 12, which is characterized in that described according to the sub- similarity Determine the similarity between the audio to be detected and benchmark audio, comprising:

The first weighted value is set for the sub- similarity of the editing distance, and is arranged second for the sub- similarity of the Hamming distance Weighted value；

Penalty term is set by the sub- similarity of the Euclidean distance；

According to first weighted value, the second weighted value and penalty term, determine between the audio to be detected and benchmark audio Similarity.

14. audio similarity detection method according to any one of claims 1 to 9, which is characterized in that described according to Similarity distance determines after the similarity between the audio to be detected and benchmark audio, which comprises

When the similarity between the audio to be detected and benchmark audio is greater than default similarity threshold, executes virtual resource and turn Operation is moved, and/or shows the relevant information of the similarity testing result of the audio to be detected.

15. audio similarity detection method according to any one of claims 1 to 9, which is characterized in that described according to Similarity distance determines after the similarity between the audio to be detected and benchmark audio, which comprises

When the similarity between the audio to be detected and benchmark audio is greater than default similarity threshold, audio lock is unlocked in execution Operation.

16. a kind of audio similarity detection device characterized by comprising

Audio acquiring unit, for obtaining audio to be detected；

Screening unit, for filtering out the audio for meeting preset condition from the audio to be detected, and according to the sound filtered out Frequency obtains the characteristic sequence of the audio to be detected；

Distance acquiring unit, the reference characteristic sequence for obtaining the characteristic sequence of the audio to be detected, with the benchmark audio Similarity distance between column；

Determination unit, for determining the similarity between the audio to be detected and benchmark audio according to the similarity distance.

17. a kind of storage medium, which is characterized in that the storage medium is stored with a plurality of instruction, and described instruction is suitable for processor It is loaded, 1 to 16 described in any item audio similarity detection methods is required with perform claim.

18. a kind of computer equipment, including memory and processor, which is characterized in that the memory is stored with determining machine journey Sequence, when the determining machine program is executed by the processor, so that the processor executes such as any one of claims 1 to 16 The audio similarity detection method.