CN109977255A

CN109977255A - Model generating method, audio-frequency processing method, device, terminal and storage medium

Info

Publication number: CN109977255A
Application number: CN201910134014.5A
Authority: CN
Inventors: 贾少勇
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2019-07-05

Abstract

The embodiment of the invention provides a kind of model generating method, audio-frequency processing method, device, terminal and computer readable storage mediums, the model generating method includes: to be labeled according to preset musical genre labels to sample audio data, generates annotated audio sample；The annotated audio sample is cut into multiple annotated audio data segments of preset length；Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, using as mark sample set；The preset musical genre labels of the mark sample audio section feature vector each in the mark sample set are updated, mark sample audio training set is obtained；The mark sample audio training set is trained using deep learning method, obtains first music style marking model.It realizes and target audio data is inputted into first music style marking model, obtain the purpose of music style label.

Description

Model generating method, audio-frequency processing method, device, terminal and storage medium

Technical field

The present invention relates to network technique fields, more particularly to model generating method, audio-frequency processing method, terminal and calculating Machine readable storage medium storing program for executing.

Background technique

With the universal and development of video or audio network, many videos and audio website are emerged, user is facilitated to regard Interested video or audio are searched on frequency or audio website, greatly enriches the life of user.

Currently, for a large amount of audio, video datas by user's self-control or official's production stored on video or audio website For users to use, wherein recommend the function of audio-video to have great demand to user for the music style of audio, video data. However, in the prior art, marked often through the artificial music style for carrying out audio-video website, low efficiency and at high cost.

Therefore, how the mark for carrying out music style to the audio, video data stored on audio-video website of efficiently and accurately is There is technical problem to be solved at present.

Summary of the invention

The technical problem to be solved is that provide a kind of model generating method, audio-frequency processing method, dress for the embodiment of the present invention It sets, terminal and computer readable storage medium, to solve to the music associated video data or audio number stored in video website The technical issues of according to the mark for carrying out music style.

To solve the above-mentioned problems, the present invention is achieved through the following technical solutions:

First aspect provides a kind of model generating method, which comprises

Sample audio data is labeled according to preset musical genre labels, generates annotated audio sample；

The annotated audio sample is cut into multiple annotated audio data segments of preset length；

Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, to make To mark sample set；

By the preset musical genre labels of the mark sample audio section feature vector each in the mark sample set It is updated, obtains mark sample audio training set；

The mark sample audio training set is trained using deep learning method, obtains first music style mark Model.

Second aspect provides a kind of audio-frequency processing method, which comprises

The mark for carrying out music style to target audio data is received to request；

It is requested according to the label, using music style marking model, marks the music style of the target audio data.

The third aspect provides a kind of model generating means, and described device includes:

Annotated audio sample generation module, for being labeled according to preset musical genre labels to sample audio data, Generate annotated audio sample；

Annotated audio data segment obtains module, for the annotated audio sample to be cut into multiple marks of preset length Audio data section；

Sample set determining module is marked, for handling each annotated audio data segment for the mark of multiple default dimensions Sample audio section feature vector, using as mark sample set；

Sample audio training set generation module is marked, is used for the mark sample audio section each in the mark sample set The preset musical genre labels of feature vector are updated, and obtain mark sample audio training set；

First music style marking model training module, for being instructed using deep learning method to the mark sample audio Practice collection to be trained, obtains first music style marking model.

Fourth aspect provides a kind of apparatus for processing audio, and described device includes:

Music style marks request receiving module, asks for receiving the mark for carrying out music style to target audio data It asks；

Music style labeling module, using music style marking model, marks the mesh for requesting according to the label Mark the music style of audio data.

5th aspect provides a kind of terminal, comprising: memory, processor and is stored on the memory and can be described The computer program run on processor realizes such as above-mentioned model generation side when the computer program is executed by the processor Step in method, or such as the step of above-mentioned audio-frequency processing method.

6th aspect provides a kind of computer readable storage medium, and calculating is stored on the computer readable storage medium Machine program realizes the step in such as above-mentioned model generating method when the computer program is executed by processor, or such as above-mentioned Audio-frequency processing method in step.

Compared with prior art, the embodiment of the present invention includes following advantages:

In the embodiment of the present invention, for the audio data in audio-video website, marked using preset musical genre labels After note, by pretreatment, if audio data is cut into pieces post-processing as the feature vector of default dimension, then music is carried out Mark sample audio training set is obtained after the update of genre labels, using deep learning method to mark sample audio training Collection is trained, and obtains first music style marking model.Then, target audio data are inputted into above-mentioned first music style mark Injection molding type obtains the music style of first music style marking model output.Wherein, above-mentioned music style is preset, such as Pop music, hip-hop music, rock music, rhythm and blues etc..In this way, being marked by all music styles to realize audio-video Data carry out the purpose of music style label, realize the mesh that watching focus type mark is carried out for various video data precise and high efficiencies , have the beneficial effect that efficiently and accurately realizes the music style label of audio, video data.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The application can be limited.

Detailed description of the invention

Fig. 1 is a kind of flow chart of model generating method provided in an embodiment of the present invention；

Figure 1A is a kind of audio signal schematic diagram provided in an embodiment of the present invention；

Figure 1B is a kind of audio data windowing process schematic diagram provided in an embodiment of the present invention；

Fig. 2 is a kind of audio-frequency processing method flow chart provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of model generating means provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of apparatus for processing audio provided in an embodiment of the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Referring to Fig. 1, being a kind of flow chart of model generating method provided in an embodiment of the present invention, specifically include:

Step 101 is labeled sample audio data according to preset musical genre labels, generates annotated audio sample；

In the embodiment of the present invention, sample audio data is the audio, video data concentration extraction in the storage of audio-video website backstage Out, wherein the storage mode of audio, video data collection generally can be with the form storage that the time marks, such as 1 year first quarter User freely upload audio, video data and official production upload audio, video data gather, in these audio, video data collection It is middle extraction wherein audio data as audio sample.

For example, extracting audio data in video data as audio sample, or by audio data directly as audio The audio data extracted in video data and the audio data set stored naturally can also be synthesized audio sample by sample.

Wherein, the specific method for extracting video data sound intermediate frequency data is described as follows:

It is read by the video data in real-time messages transport protocol (RTMP, Real Time Messaging Protocol) The method for taking packet RTMP_ReadPacket to obtain video and corresponding audio data are as follows:

1, the audio sync packet in video data is obtained；

2, the audio head decoding data AACDecoderSpecificInfo and audio data parsed in audio sync packet matches Confidence ceases AudioSpecificConfig.Wherein, audio data configuration information AudioSpecificConfig is for generating ADST (including sample rate, channel number, the frame length data in audio data).

3, other audio packs in video data are obtained, and parse original audio data (i.e. ES).

4, the ES stream of AAC is packaged by audio data head AAC decoder the format of ADTS, wherein be in AAC ES The header file ADTSheader of 7 bytes of addition before stream, to parse audio data content.

As above, i.e., by the packets of audio data extracted in parsing video data, and then the specific interior of audio data is parsed Hold, that is, has extracted the audio content in video data.

It is to be appreciated that the method that the audio data extracting mode in video data is not limited to foregoing description, the present invention is real It is without restriction to the extracting mode of audio data to apply example.

After obtaining audio sample by the above method, by predefining 25, music style label, for example, popular sound Pleasure, hip-hop music, rock music, rhythm and blues, soul music, Reggae, country music, Funk, folk rhyme, Middle East music, Disco, classical music, electronic music, Latin music, Blues, child music, new generation music, vocal music, African music, Christmas Music, Asian music, Ska, separate music, traditional music etc. manually mark above-mentioned audio sample, obtain annotated audio Sample.

Step 102, multiple annotated audio data segments that the annotated audio sample is cut into preset length；

In practical applications, the length disunity of annotated audio sample will cause data mistake when carrying out batch processing Difference finally obtains the training sample for meeting preset standard so needing to cut audio data, such as sample total is total to 140825, average every class 5633, when each sample, is about 10 seconds.

Wherein, annotated audio sample is split, obtains N number of annotated audio data segment of default size.It can will be upper It states annotated audio sample and imports preset audio cutter and cut, cutting audio data can be pre-set in cutting The duration of section, audio cutter may be implemented to carry out batch cutting according to the duration.

Certainly, the embodiment of the present invention is without restriction to the type of audio cutter.

It is to be appreciated that needing the preset requirement to training sample different on different models, therefore the embodiment of the present invention It is without restriction to the specific length of audio section.

Each annotated audio data segment is handled the mark sample audio section feature for multiple default dimensions by step 103 Vector, using as mark sample set；

Preferably, step 103, further comprise:

Each annotated audio data segment is carried out sub-frame processing respectively by sub-step 1031, obtains each mark sound Multiple framing annotated audio data segments of frequency data segment；

Specifically, as shown in Figure 1A, voice signal is being macroscopically jiggly, is smoothly, to have short on microcosmic When stationarity (voice signal chosen in box in such as figure, in 10---30ms it is considered that voice signal approximation it is constant), this A voice signal to be divided into segment to be handled, each segment is known as a frame (CHUNK), each certain piece The duration of section is not limited to the 10---30ms of foregoing description, and the embodiment of the present invention is without restriction to the duration of frame.

Therefore, annotated audio data segment is further divided into the smaller mark framing audio data as unit of frame.

Each framing annotated audio data segment is multiplied by sub-step 1032 with windowed function respectively, obtains each described The mark adding window audio data section of framing annotated audio data segment；

Specifically, when framing, each frame can repeat to intercept a part, i.e., the tail portion of previous true frame and present frame After head respectively takes a part to be overlapped, then windowing process is carried out, so overall situation voice signal will not make one because of windowing process The both ends part of frame signal, which is weakened, obtains the audio data of excessive noise reduction, so realizing the weight between frame and frame in framing It is folded, so that the audio signal after windowing process is more continuous.

Wherein, mark framing audio data obtained above is subjected to windowing process, i.e., is as schemed by original audio signal In 1B shown in left-hand component, by being multiplied with the intermediate windowed function as shown in the middle section Figure 1B, obtain on the right of Figure 1B Logarithmic spectrum of every frame audio data on frequency domain shown in part, so that originally without periodic voice signal (such as mark point Frame audio data) Partial Feature that shows periodic function, it is determined as the mark adding window audio data of above-mentioned framing audio data Section.

Each mark adding window audio data section is carried out Meier transformation respectively by sub-step 1033, obtains each mark Infuse the mark Meier frequency spectrum data of audio data section.

Further, in order to enable sound characteristic is intuitive in the mark adding window audio data obtained after framing and windowing process It shows, needs to carry out Meier transformation, audio data is converted into mark Meier frequency spectrum data, wherein the unit of frequency is hertz (Hz), the frequency range that human ear can be heard is 20-20000Hz, but human ear is not that linear perception is closed to this scale unit of Hz System.For example, if pitch frequency is increased to 2000Hz, that ear can only be perceived if people have adapted to the tone of 1000Hz A little is improved to frequency, frequency is detectable at all and is doubled.If converting Meier for common frequency scaling Frequency scaling, then human ear is to the perceptibility of frequency just at linear relationship.That is, under Meier scale, if two sections of languages The mel-frequency of sound differs twice, then the tone that human ear can perceive probably also differs twice.Having audio data realization can Depending on the beneficial effect changed.

Each mark Meier frequency spectrum data is converted to the feature vector of default dimension respectively, obtained by sub-step 1034 To the mark sample audio section feature vector of each mark Meier frequency spectrum data.

In this step, above-mentioned mark Meier spectral image data are converted into the feature vector that machine can identify, wherein Image data, which is converted to the common model of machine readable feature vector, BVLCGoogLeNet model, certainly, in practical application In, it is not limited to the conversion regime of foregoing description, the embodiments of the present invention are not limited thereto.

Preferably, step 1034, further comprise:

Sub-step 10341, by it is described mark Meier frequency spectrum data in the corresponding Meier spectrum number of each frame audio data According to being determined as sample framing Meier frequency spectrum data；

It in this step, extracts every frame audio in the annotated audio data segment of above-mentioned acquisition and corresponds to Meier frequency spectrum data figure, make For framing Meier frequency spectrum data, it is determined as marking framing Meier frequency spectrum data.

The sample framing Meier frequency spectrum data is converted to sample framing audio feature vector by sub-step 10342；

In this step, each mark framing Meier spectrogram data is subjected to feature vector conversion.

Specifically, above-mentioned mark framing Meier spectral image data are converted to point by image feature vector transformation model Frame audio feature vector, wherein known common image feature vector transformation model has BVLC GoogLeNet model, it is one A 22 layers of deep convolutional network can detect the feature vector of 1000 kinds of different image types.

Certainly, foregoing description is not limited to for characteristics of image conversion method, the embodiments of the present invention are not limited thereto.

Sub-step 10343 splices the sample framing audio feature vector of default frame number, obtains default dimension Mark sample audio section feature vector；

In this step, for the sample framing audio frequency characteristics for marking framing Meier frequency spectrum data obtained in step 10342 After vector, by multiple sample framing audio feature vectors merge into the mark sample audio section feature of a default dimension to Amount is directed to the audio data of one second frame for example, framing audio feature vector is the feature vector of 128 dimensions, and for sound The processing of frequency evidence, the information that one second frame is included are not enough to characterize the concrete type of audio data, so by the framing audio The context-sensitive framing audio feature vector of feature vector merges, i.e. the corresponding feature vector of 3 seconds audio datas, i.e., 3 framing audio feature vectors are spliced into the feature vector of 128*3=384 dimension.

Certainly, default dimension is not necessarily 384 dimensions mentioned above, is also possible to the group of five framing audio feature vectors The audio feature vector for the default dimension of conjunction or ten framing audio feature vectors being composed, so, preset dimension Whether what setting depended primarily on audio data includes enough information in case subsequent processing, therefore, the embodiment of the present invention is to pre- If the specific value of dimension is not limited.

Each mark sample audio section set of eigenvectors is combined into mark sample set by sub-step 1035.

In this step, above-mentioned all mark sample audio section feature vectors are stored as a set, as mark sample Collection.

Step 104 marks each preset musical for marking sample audio section feature vector in sample set for described Genre labels are updated, and obtain mark sample audio training set；

In this step, the audio data typically directly downloaded there are strong noise (respectively data noise and label noise), If directly training music style marking model, accuracy rate are lower.So further to above-mentioned mark sample audio training set It carries out data cleansing, that is, with marking sample audio training set music style model, then is carried out with sample of the model to every class The cleaning of label noise, the final music style data set for obtaining high quality, specific steps are described as follows:

Preferably, step 104, further comprise:

Sub-step 1041, according to preset ratio, from the mark sample set extract the mark sample audio section feature to Amount, using as training sample feature set；

In this step, if marking sample set total amount totally 140825 mark sample audio section feature vectors, average every class The mark sample audio section feature vector of music style has 5633, and when each sample is about 10 seconds, by default ratio therein A part (such as 20%) in example (such as 50%) is extracted as training sample feature set, and is extracted and wherein 20% be used as training Sample characteristics.

The training sample feature set is trained by sub-step 1042 by predetermined deep learning method, obtains second Music style marking model；

In this step, training sample feature is trained by predetermined deep learning algorithm, obtains the second music style Marking model, wherein predetermined deep learning algorithm can be Softmax classifier, certainly, in practical applications not It is limited to Softmax classifier, the embodiment of the present invention is without restriction to specific deep learning method.

Sub-step 1043, using the mark sample audio section feature vector remaining in the mark sample set as test Sample characteristics collection, and the test sample feature set is inputted into the second music style marking model, so that second sound Happy style marking model exports the music style of each mark sample audio section feature vector in the test sample feature set Label generates and updates mark sample set；

Wherein, remaining 30% in 50% in above-mentioned totally 140825 sample totals is extracted and is divided into conduct three times Test set is cleaned, above-mentioned trained second music style marking model is inputted.

Specifically, music style label for labelling is carried out, the test set that will have been marked each time is again added to training set In, it trains again, generates the second music style marking model of update, then extract 10% test set progress genre labels mark Note, mark are put into training set, second of the second music style marking model updated of training, so until all tests after the completion Collection all returns in training set, then the data in training set are to complete the sample data of cleaning, that is, update mark sample Collection.

Sub-step 1044 merges update mark sample set with the training sample feature set, obtains mark sample Audio training set.

In this step, the update mark sample set of above-mentioned completion cleaning merges with training sample feature set, as mark sample This audio training set.

It unlabelled sample data is input to the second music style marking model marks it is to be appreciated that repeated multiple times Note, mark are completed to gather training sample update training pattern again, can effectively improve mark accuracy rate, and training sample is got over Huge, the accuracy rate that training pattern is used to mark is higher, the second music style marking model obtained by repetition training, finally The music segmentation tag of all test sets marked out marks sample set in conjunction with above-mentioned update, and what is obtained is mark sample sound Frequency training set.

Step 105 is trained the mark sample audio training set using deep learning method, obtains first music Style marking model.

In this step, mark sample audio training set obtained above is carried out again by predetermined deep learning method Training, finally obtains first music style marking model, effectively reduces the manpower of music label in artificial mark training sample Cost, and training sample data amount is improved, improve model training efficiency and mark accuracy rate.

In embodiments of the present invention, it by being labeled according to preset musical genre labels to sample audio data, generates Annotated audio sample；The annotated audio sample is cut into multiple annotated audio data segments of preset length；By each mark Infuse audio data section processing be multiple default dimensions mark sample audio section feature vector, using as mark sample set；By institute The preset musical genre labels for stating each mark sample audio section feature vector in mark sample set are updated, and are obtained Mark sample audio training set；The mark sample audio training set is trained using deep learning method, obtains first Music style marking model can carry out music style label to the audio data for not having music style label with efficiently and accurately Label.

Referring to Fig. 2, be a kind of flow chart of audio-frequency processing method provided in an embodiment of the present invention, can specifically include as Lower step:

Step 201 receives the mark request that target audio data are carried out with music style；

In the embodiment of the present invention, back-end server receives the music style mark that user is sent by application interface and asks It asks, wherein music style mark requests one that the multitude of video data set of usual corresponding server storage or audio data are concentrated Or multiple data sets carry out, wherein audio, video data set is usually to be stored according to the date, is also possible to according to upload The data acquisition system of storage is marked in user identifier.Gather for example, the audio-video that user in February uploads is stored as one, in official The audio-video of biography is stored as a set, initiating the label request of video watching focus is initiated for selected one or more set.

In practical applications, the mark for initiating music style for audio data sets or video data set is requested, such as Fruit is directed to video data set, it is necessary to the audio data in video data set is extracted as target audio data, and It is then further processed directly as target audio data for audio data sets.

Wherein, the extracting mode of video data sound intermediate frequency data is described in detail in a step 101, no longer superfluous herein It states.

Certainly, for the concrete mode of audio-video storage set, it is not limited to foregoing description, this is not added in the embodiment of the present invention With limitation.

Wherein, music style, i.e. style of song, the representative style for referring to that musical works shows on the whole are special Point.

Therefore, it is concentrated for large data and carries out music style analysis, it can be automatically and efficiently to the short-sighted frequency of magnanimity or short Audio data carries out music style analysis, to realize the purpose to user-customized recommended.

Step 202 is requested according to the label, using music style marking model, marks the target audio data Music style；

Preferably, step 202, further comprise:

Sub-step 2021 is requested according to the mark, and the target audio data are divided into the audio number of preset length According to section；

In the embodiment of the present invention, the audio data of said extracted is split, obtains the N section audio data of default size Section, wherein above-mentioned audio data can be imported into preset audio cutter and cut, can manually selected and cut in cutting The duration at audio data end is cut, and audio cutter may be implemented batch and cut.

Certainly, the embodiment of the present invention is without restriction to the type of audio cutting method.

Sub-step 2022, the feature vector by each audio data section processing to preset dimension；

In this step, the audio frequency characteristics of default dimension are converted to after pre-processing to each audio data section obtained above Vector, description specific as follows:

Preferably, sub-step 2022 further comprise:

Sub-step 20221 carries out sub-frame processing to each audio data section, obtains framing audio data section；

In this step, in this step, framing windowing process will be carried out by second audio signal in above-mentioned each audio data section It is converted with Meier.

Wherein, sub-frame processing is as shown in Figure 1A, voice signal be macroscopically it is jiggly, on microcosmic be smoothly, With short-term stationarity (as shown in box, it is considered that voice signal approximation is constant in 10---30ms), this can Handled so that voice signal is divided into segment, each segment is known as a frame (CHUNK), each certain segment when The long 10---30ms for being not limited to foregoing description, the embodiment of the present invention are without restriction to the duration of frame.

The framing audio data section is multiplied by sub-step 20222 with windowed function, obtains adding window audio data section.

Wherein, it when framing, not intercept back-to-back, but overlapped a part, i.e., the tail of previous true frame After portion respectively takes a part Chong Die with the head of present frame, then windowing process is carried out, so overall situation voice signal will not be because of adding window It handles and the both ends part of a frame signal is weakened and obtains the audio data of excessive noise reduction, so realizing frame in framing It is overlapping between frame, so that the audio signal after windowing process is more continuous.

Wherein, audio section downlink data upon handover obtained above is subjected to windowing process, i.e. original audio signal is as in Figure 1B Shown in left-hand component, by being multiplied with the intermediate windowing process function as shown in the middle section Figure 1B, obtain on the right of Figure 1B Logarithmic spectrum of every frame audio data on frequency domain shown in part, so that showing the period without periodic voice signal originally The Partial Feature of function to get arrive audio section windowed data.

The adding window audio data section is carried out Meier transformation by sub-step 20223, obtains the Meier of the audio data section Frequency spectrum data.

Further, in order to obtain after framing and windowing process to audio section windowed data in sound characteristic it is intuitive It shows, needs to carry out Meier transformation to adding window audio data section, audio data is converted into Meier frequency spectrum data, have sound spy The linear beneficial effect intuitively shown of sign.

Sub-step 20224, the feature vector that the Meier frequency spectrum data is converted to default dimension.

Preferably, sub-step 20224 further comprise:

Sub-step 202241, by the corresponding Meier frequency spectrum data of each frame audio data in the Meier frequency spectrum data, It is determined as framing Meier frequency spectrum data；

In this step, intercept every frame audio in the audio data section of above-mentioned acquisition and correspond to Meier frequency spectrum data figure, as point Frame Meier frequency spectrum data, that is, the Meier spectrogram data being segmented are determined as framing Meier frequency spectrum data.

The framing Meier frequency spectrum data is converted to framing audio feature vector by sub-step 202242；

In this step, each framing Meier spectrogram data is subjected to feature vector conversion.

Specifically, above-mentioned framing Meier spectral image data are converted into framing sound by image feature vector transformation model Frequency feature vector, wherein known common image feature vector transformation model has BVLC GoogLeNet model, it is one 22 1000 kinds of different image format conversions can be machine-readable features vector by the deep convolutional network of layer.

Sub-step 202243 splices the framing audio feature vector of default frame number, obtains default dimension Feature vector.

In this step, after the framing audio feature vector of above-mentioned framing Meier frequency spectrum data, by multiple framing sounds Frequency feature vector merges into the audio feature vector of a default dimension, for example, framing audio feature vector is the feature of 128 dimensions Vector is directed to the audio data of one second frame, and the processing for audio data, the information that one second frame is included are not enough to The concrete type for characterizing audio data, so by the context-sensitive framing audio feature vector of the framing audio feature vector It merges, i.e. the corresponding feature vector of 3 seconds audio datas, i.e. 3 framing audio feature vector splicings generate a 128*3= The feature vector of 384 dimensions.

The feature vector is input to music style marking model by sub-step 2023, so that the music wind Lattice marking model exports the music style label of the feature vector；

In this step, the audio feature vector of the above-mentioned default dimension being spliced is inputted into trained first music wind Lattice marking model exports the music style label of each audio feature vector.

Wherein, music style includes preset pop music, hip-hop music, rock music, rhythm and blues, soul sound Pleasure, Reggae, country music, Funk, folk rhyme, Middle East music, Disco, classical music, electronic music, Latin music, Blues, Child music, new generation music, vocal music, African music, Christmas, Asian music, Ska, separate music, traditional music etc..

Certainly, music style be not limited to it is above-mentioned enumerate, the present invention is without restriction to this.

Sub-step 2024 obtains the music style label of each audio data section in the target audio data Number；

In this step, for multiple audio feature vectors that each audio data for being not fixed duration is handled, to wherein Each audio feature vector carry out the output of music style label after, then entire audio data has multiple music styles label, this When, it needs to take voting mechanism, the music style number of tags of each audio feature vector in entire audio data is counted.

Wherein, audio data is divided into the small fragment of 3s-5s, or also commonly uses the small fragment of 8s-10s, then will be above-mentioned small Segment carries out framing and windowing process and Meier converts to obtain image feature data, and each image feature data obtains one Music style label, then a segment of audio data may include multiple music style labels.

For example, each 3 seconds data segments correspond to different labels in one when a length of 5 minutes video datas, then whole A 5 minutes video datas are made of 100 type labels, obtain the corresponding number of each type.

Sub-step 2025, by the number maximum value, or, the number is greater than or equal to the music style mark of preset threshold Corresponding music style is signed, the music style of the target audio data is determined as.

In this step, as described above, 100 music style labels are right respectively in obtaining 5 minutes video data ends After the number answered, the most music style label of number is determined as to the music style label of 5 minutes video datas, or will Music style number of tags is ranked up, and takes out the label of sequence top N, the music style as the audio data section.

Certainly, in practical applications, a quantity threshold can also be preset, being more than in a certain music style number of tags should When quantity threshold, that is, it is arranged to the music style label of the video data, for example, in 100 music style labels, it is pre- to be marked with Signing quantity threshold is 30, wherein the music style label more than 30 has rock music and traditional music, then the audio data Music style label be rock music and traditional music, and above-mentioned label is determined as the corresponding video of the audio data The music style label of data carries out can be merged into traditional rock music when recommendation operates subsequent.

Music style labeling method of the present invention is illustrated below by way of specific example:

1) when carrying out music style label to video data, the audio data of video data is obtained first.

2) audio signal that will acquire carries out framing windowing process and Meier transformation, obtains the Meier frequency spectrum of audio data Figure；

3) Meier spectrogram input VGGish depth model is obtained to the feature vector of the default dimension of Meier spectrogram；

4) by above-mentioned default dimensional characteristics vector input first pass through in advance machine learning algorithm Softmax Classifier into The music style markup model of row training obtains the preset kind label of each default dimensional characteristics vector, such as hip-hop, rock and roll, stream Row, folk rhyme, allusion, electronics etc.；

5) type of most music style number of labels, or the mark more than preset threshold will be finally obtained in audio data Note is determined as the corresponding music style of the audio data.

The embodiment of the invention provides a kind of audio-frequency processing methods, carry out music style to target audio data by receiving Mark request after, obtain target audio data, and according to the mark request, the target audio data are divided into default Each audio data section processing is the feature vector of default dimension by the audio data section of length；By the audio Section feature vector is input to trained first music style marking model, marks music style label；Obtain the target sound The number of frequency music style label of each audio data section in；According to music style number of tags, determining pair The final music style for answering video data realizes the mesh that batch is efficiently labeled video data sound intermediate frequency music style , the cost of labor of music style label is saved, music style labeling effciency is improved.

It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.

Referring to Fig. 3, being a kind of structural schematic diagram of model generating means 300 provided in an embodiment of the present invention, specifically may be used To include following module:

Annotated audio sample generation module 301, for being marked according to preset musical genre labels to sample audio data Note generates annotated audio sample；

Annotated audio data segment obtains module 302, for the annotated audio sample to be cut into the multiple of preset length Annotated audio data segment；

Sample set determining module 303 is marked, for handling each annotated audio data segment for multiple default dimensions Mark sample audio section feature vector, using as mark sample set；

Preferably, the mark sample set determining module 303, comprising:

Annotated audio data segment generates submodule, for respectively carrying out each annotated audio data segment at framing Reason, obtains multiple framing annotated audio data segments of each annotated audio data segment；

Mark adding window audio data section generate submodule, for respectively will each framing annotated audio data segment with add Window function is multiplied, and obtains the mark adding window audio data section of each framing annotated audio data segment；

Mark Meier frequency spectrum data obtains submodule, for each mark adding window audio data section to be carried out plum respectively You convert, and obtain the mark Meier frequency spectrum data of each annotated audio data segment；

Mark sample audio section feature vector obtains submodule, for respectively turning each mark Meier frequency spectrum data It is changed to the feature vector of default dimension, obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data；

Preferably, the mark sample audio section feature vector obtains submodule, comprising:

Sample framing Meier frequency spectrum data determination unit, for by it is described mark Meier frequency spectrum data in each frame audio The corresponding Meier frequency spectrum data of data is determined as sample framing Meier frequency spectrum data；

Sample framing audio feature vector acquiring unit, for the sample framing Meier frequency spectrum data to be converted to sample Framing audio feature vector；

Mark sample audio section feature vector obtain unit, for by the sample framing audio frequency characteristics of default frame number to Amount is spliced, and the mark sample audio section feature vector of default dimension is obtained.

Mark sample set determines submodule, for each mark sample audio section set of eigenvectors to be combined into mark sample Collection.

Sample audio training set generation module 304 is marked, is used for the mark sample sound each in the mark sample set The preset musical genre labels of frequency range feature vector are updated, and obtain mark sample audio training set；

Preferably, the mark sample audio training set generation module 304, comprising:

Training sample feature generates submodule, for extracting the mark from the mark sample set according to preset ratio Sample audio section feature vector, using as training sample feature set；

Second music style marking model training module, for learning the training sample feature set by predetermined depth Method is trained, and obtains the second music style marking model；

Mark sample set submodule is updated, is used for the mark sample audio Duan Te remaining in the mark sample set Vector is levied as test sample feature set, and the test sample feature set is inputted into the music style marking model, so that The second music style marking model exports each mark sample audio section feature vector in the test sample feature set Music style label, generate update mark sample set；

Sample audio training set acquisition submodule is marked, for update mark sample set and the training sample is special Collection merges, and obtains mark sample audio training set.

First music style marking model training module 305, for utilizing deep learning method to the mark sample sound Frequency training set is trained, and obtains first music style marking model.

In the embodiment of the present invention, by annotated audio sample generation module, it is used for according to preset musical genre labels to sample This audio data is labeled, and generates annotated audio sample；Annotated audio data segment obtains module, is used for the annotated audio Sample is cut into multiple annotated audio data segments of preset length；Sample set determining module is marked, is used for each mark sound Frequency data segment processing be multiple default dimensions mark sample audio section feature vector, using as mark sample set；Mark sample Audio training set generation module, for by it is described mark sample set in it is each it is described mark sample audio section feature vector it is described pre- If music style label is updated, mark sample audio training set is obtained；First music style marking model training module is used In being trained using deep learning method to the mark sample audio training set, first music style marking model is obtained, It can be with the label for carrying out music style label to the audio data for not having music style label of efficiently and accurately.

Optionally, in another embodiment, as shown in figure 4, including a kind of apparatus for processing audio 400, described device includes:

Music style marks request receiving module 401, for receiving the mark that target audio data are carried out with music style Request；

Music style labeling module 402, using music style marking model, marks institute for requesting according to the label State the music style of target audio data.

Preferably, the music style labeling module 402, comprising:

The target audio data are divided into pre- by audio data section acquisition submodule for being requested according to the mark If the audio data section of length；

Feature vector acquiring unit, for each audio data section processing is special for the audio section of default dimension Levy vector；

Preferably, the feature vector acquiring unit includes:

Framing audio data section obtains unit, for carrying out sub-frame processing to each audio data section, obtains framing sound Frequency data segment；

Adding window audio data section obtains unit, for the framing audio data section to be multiplied with windowed function, is added Window audio data section；

Meier frequency spectrum data obtains unit, for the adding window audio data section to be carried out Meier transformation, obtains the sound The Meier frequency spectrum data of frequency data segment；

Feature vector acquiring unit, the audio section for the Meier frequency spectrum data to be converted to default dimension are special Levy vector.

Preferably, the feature vector acquiring unit, comprising:

Framing Meier frequency spectrum data determines subelement, for by each frame audio data pair in the Meier frequency spectrum data The Meier frequency spectrum data answered is determined as framing Meier frequency spectrum data；

Framing audio feature vector obtains subelement, special for the framing Meier frequency spectrum data to be converted to framing audio Levy vector；

Feature vector obtains subelement, for spelling the framing audio feature vector of default frame number It connects, obtains the feature vector of default dimension.

Music style label acquisition submodule, for the feature vector to be input to music style mark mould Type, so that the music style marking model exports the music style label of the feature vector；

Music style number of tags acquisition submodule, for obtaining each audio data section in the target audio data The music style label number；

Music style label determines submodule, for being preset or, the number is greater than or equal to by the number maximum value The corresponding music style of music style label of threshold value, is determined as the music style of the target audio data.

In the embodiment of the present invention, music style marks request receiving module, carries out sound to target audio data for receiving The mark of happy style is requested；Music style labeling module, for according to label request, using music style marking model, Mark the music style of the target audio data.Batch is realized efficiently to be labeled video data sound intermediate frequency music style Purpose, save music style label cost of labor, improve music style labeling effciency.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

In the embodiment of the present invention, in the video search request for receiving user's input, first marks the video search and ask The label and tag types are input in video semanteme label independence model by label and tag types in asking, screening Semantic independent label out, and video search is carried out to semantic independent label, it obtains and the independent label phase of the semanteme The video matched.The embodiment of the present invention is scanned for according to the independent label of semanteme filtered out, is reduced due to accidentally searching for label Incoherent video search result is recalled, to improve the accuracy rate of video search.

Optionally, the embodiment of the present invention also provides a kind of terminal, including processor, and memory stores on a memory simultaneously The computer program that can be run on the processor, the computer program realize above-mentioned model generation side when being executed by processor Each process of method or audio-frequency processing method embodiment, and identical technical effect can be reached, it is no longer superfluous here to avoid repeating It states.

Optionally, the embodiment of the present invention also provides a kind of computer readable storage medium, on computer readable storage medium It is stored with computer program, which realizes above-mentioned model generating method or audio-frequency processing method when being executed by processor Each process of embodiment, and identical technical effect can be reached, to avoid repeating, which is not described herein again.Wherein, the meter Calculation machine readable storage medium storing program for executing, such as read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc..

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.

The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.

These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.

Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the claim is intended to be construed to Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.

It above can to a kind of model generating method provided by the present invention, audio-frequency processing method, device, terminal and computer Storage medium is read, is described in detail, specific case used herein carries out the principle of the present invention and embodiment It illustrates, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile for this field Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up Described, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of model generating method characterized by comprising

Each annotated audio data segment is handled into the mark sample audio section feature vector for multiple default dimensions, using as mark Infuse sample set；

The preset musical genre labels of the mark sample audio section feature vector each in the mark sample set are carried out It updates, obtains mark sample audio training set；

The mark sample audio training set is trained using deep learning method, obtains first music style mark mould Type.

2. the method according to claim 1, wherein described by the mark sample each in the mark sample set The preset musical genre labels of feature vector are updated, and obtain mark sample audio training set, comprising:

According to preset ratio, the mark sample audio section feature vector is extracted from the mark sample set, using as trained sample Eigen collection；

The training sample feature set is trained by predetermined deep learning method, obtains the second music style mark mould Type；

Using the mark sample audio section feature vector remaining in the mark sample set as test sample feature set, and will The test sample feature set inputs the second music style marking model, so that the second music style marking model is defeated Each music style label for marking sample audio section feature vector in the test sample feature set out, generates and updates mark Sample set；

Update mark sample set is merged with the training sample feature set, obtains mark sample audio training set.

3. the method according to claim 1, wherein described handle each annotated audio data segment is multiple The mark sample audio section feature vector of default dimension, using as mark sample set, comprising:

Each annotated audio data segment is subjected to sub-frame processing respectively, obtains multiple points of each annotated audio data segment Frame annotated audio data segment；

Each framing annotated audio data segment is multiplied with windowed function respectively, obtains each framing annotated audio data The mark adding window audio data section of section；

Each mark adding window audio data section is subjected to Meier transformation respectively, obtains the mark of each annotated audio data segment Infuse Meier frequency spectrum data；

The feature vector that each mark Meier frequency spectrum data is converted to default dimension respectively, obtains each mark Meier The mark sample audio section feature vector of frequency spectrum data；

Each mark sample audio section set of eigenvectors is combined into mark sample set.

4. according to the method described in claim 3, it is characterized in that, described respectively turn each mark Meier frequency spectrum data It is changed to the feature vector of default dimension, obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data, packet It includes:

By the corresponding Meier frequency spectrum data of each frame audio data in the mark Meier frequency spectrum data, it is determined as sample framing Meier frequency spectrum data；

The sample framing Meier frequency spectrum data is converted into sample framing audio feature vector；

The sample framing audio feature vector of default frame number is spliced, the mark sample audio section of default dimension is obtained Feature vector.

5. a kind of audio-frequency processing method characterized by comprising

It is requested according to the label, using music style marking model, marks the music style of the target audio data；It is described Music style marking model is to be obtained using any one of any one of claims 1 to 44 the method.

6. according to the method described in claim 5, it is characterized in that, it is described according to the label request, utilize music style mark Injection molding type marks the music style of the target audio data, comprising:

It is requested according to the mark, the target audio data is divided into the audio data section of preset length；

It is the feature vector of default dimension by each audio data section processing；

The feature vector is input to music style marking model, so that the music style marking model exports institute State the music style label of feature vector；

Obtain the number of the music style label of each audio data section in the target audio data；

By the number maximum value, or, the number is greater than or equal to the corresponding music wind of music style label of preset threshold Lattice are determined as the music style of the target audio data.

7. according to the method described in claim 6, it is characterized in that, described handle each audio data section for default dimension Feature vector, comprising:

Sub-frame processing is carried out to each audio data section, obtains framing audio data section；

The framing audio data section is multiplied with windowed function, obtains adding window audio data section；

The adding window audio data section is subjected to Meier transformation, obtains the Meier frequency spectrum data of the audio data section；

The Meier frequency spectrum data is converted to the feature vector of default dimension.

8. the method according to the description of claim 7 is characterized in that described be converted to default dimension for the Meier frequency spectrum data Feature vector, comprising:

By the corresponding Meier frequency spectrum data of each frame audio data in the Meier frequency spectrum data, it is determined as framing Meier frequency spectrum Data；

The framing Meier frequency spectrum data is converted into framing audio feature vector；

The framing audio feature vector of default frame number is spliced, the feature vector of default dimension is obtained.

9. a kind of model generating means characterized by comprising

Annotated audio sample generation module is generated for being labeled according to preset musical genre labels to sample audio data Annotated audio sample；

Annotated audio data segment obtains module, for the annotated audio sample to be cut into multiple annotated audios of preset length Data segment；

Sample set determining module is marked, for handling each annotated audio data segment for the mark sample of multiple default dimensions Feature vector, using as mark sample set；

Sample audio training set generation module is marked, is used for the mark sample audio section feature each in the mark sample set The preset musical genre labels of vector are updated, and obtain mark sample audio training set；

First music style marking model training module, for utilizing deep learning method to the mark sample audio training set It is trained, obtains first music style marking model.

10. device according to claim 9, which is characterized in that the mark sample audio training set generation module, packet It includes:

Training sample feature generates submodule, for extracting the mark sample from the mark sample set according to preset ratio Feature vector, using as training sample feature set；

Second music style marking model training module, for the training sample feature set to be passed through predetermined deep learning method It is trained, obtains the second music style marking model；

Update mark sample set submodule, for by the mark sample audio section feature remaining in the mark sample set to Amount is used as test sample feature set, and the test sample feature set is inputted the second music style marking model, so that The second music style marking model exports each mark sample audio section feature vector in the test sample feature set Music style label, generate update mark sample set；

Sample audio training set acquisition submodule is marked, for the update to be marked sample set and the training sample feature set Merge, obtains mark sample audio training set.

11. device according to claim 9, which is characterized in that the mark sample set determining module, comprising:

Annotated audio data segment generates submodule, for each annotated audio data segment to be carried out sub-frame processing respectively, obtains To multiple framing annotated audio data segments of each annotated audio data segment；

It marks adding window audio data section and generates submodule, for respectively by each framing annotated audio data segment and adding window letter Number is multiplied, and obtains the mark adding window audio data section of each framing annotated audio data segment；

Mark Meier frequency spectrum data obtains submodule, for each mark adding window audio data section to be carried out Meier change respectively It changes, obtains the mark Meier frequency spectrum data of each annotated audio data segment；

Mark sample audio section feature vector obtains submodule, for being respectively converted to each mark Meier frequency spectrum data The feature vector of default dimension obtains the mark sample audio section feature vector of each mark Meier frequency spectrum data；

Mark sample set determines submodule, for each mark sample audio section set of eigenvectors to be combined into mark sample set.

12. device according to claim 11, which is characterized in that the mark sample audio section feature vector obtains submodule Block, comprising:

Sample framing Meier frequency spectrum data determination unit, for by it is described mark Meier frequency spectrum data in each frame audio data Corresponding Meier frequency spectrum data is determined as sample framing Meier frequency spectrum data；

Mark sample audio section feature vector obtain unit, for by the sample framing audio feature vector of default frame number into Row splicing obtains the mark sample audio section feature vector of default dimension.

13. a kind of apparatus for processing audio characterized by comprising

Music style marks request receiving module, for receiving the mark request that target audio data are carried out with music style；

Music style labeling module, using music style marking model, marks the target sound for requesting according to the label The music style of frequency evidence.

14. device according to claim 13, which is characterized in that the music style labeling module, comprising:

The target audio data are divided into default length for requesting according to the mark by audio data section acquisition submodule The audio data section of degree；

Feature vector acquiring unit, for by each audio data section processing for default dimension feature to Amount；

Music style label acquisition submodule, for the feature vector to be input to music style marking model, with The music style marking model is set to export the music style label of the feature vector；

Music style number of tags acquisition submodule, for obtaining the institute of each audio data section in the target audio data State the number of music style label；

Music style label determines submodule, is used for the number maximum value, or, the number is greater than or equal to preset threshold The corresponding music style of music style label, be determined as the music style of the target audio data.

15. device according to claim 14, which is characterized in that the feature vector acquiring unit includes:

Framing audio data section obtains unit, for carrying out sub-frame processing to each audio data section, obtains framing audio number According to section；

Adding window audio data section obtains unit, for the framing audio data section to be multiplied with windowed function, obtains adding window sound Frequency data segment；

Meier frequency spectrum data obtains unit, for the adding window audio data section to be carried out Meier transformation, obtains the audio number According to the Meier frequency spectrum data of section；

Feature vector acquiring unit, for the Meier frequency spectrum data is converted to the feature of default dimension to Amount.

16. device according to claim 15, which is characterized in that the feature vector acquiring unit, comprising:

Framing Meier frequency spectrum data determines subelement, for each frame audio data in the Meier frequency spectrum data is corresponding Meier frequency spectrum data is determined as framing Meier frequency spectrum data；

Framing audio feature vector obtain subelement, for by the framing Meier frequency spectrum data be converted to framing audio frequency characteristics to Amount；

Feature vector obtains subelement, for splicing the framing audio feature vector of default frame number, obtains To the feature vector of default dimension.

17. a kind of terminal characterized by comprising memory, processor and be stored on the memory and can be at the place The computer program run on reason device is realized when the computer program is executed by the processor as appointed in Claims 1-4 Step in one model generating method, or the step of the audio-frequency processing method as described in any one of claim 5 to 8 Suddenly.

18. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized as described in any one of claims 1 to 4 when the computer program is executed by processor in model generating method The step of, or the step of audio-frequency processing method as described in any one of claim 5 to 8.