CN114420075A

CN114420075A - Audio processing method and device, equipment and computer readable storage medium

Info

Publication number: CN114420075A
Application number: CN202210082760.6A
Authority: CN
Inventors: 田思达; 彭博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-04-29

Abstract

The embodiment of the application discloses an audio processing method, an audio processing device, audio processing equipment and a computer-readable storage medium. The method comprises the following steps: determining a candidate interval according to the time sequence characteristics of the audio to be processed, calculating the matching probability of the candidate interval based on the starting point probability, the end point probability and the interval probability corresponding to the candidate interval, and determining a complete and accurate refrain interval according to the comparison result of the matching probability and a first preset threshold. The method and the device can accurately determine the middle position of the refrain interval according to the interval probability of the candidate interval, avoid the condition that multiple refrain intervals are overlapped, and improve the accuracy of refrain interval positioning.

Description

Audio processing method and device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of computers, and in particular, to an audio processing method, apparatus, device, and computer-readable storage medium.

Background

A music refrain is a repeated passage of music in a song, usually occurring between several verses. Most music is strong in refraining emotion and can stimulate the emotion of a singer to the maximum, so that the singer often focuses on practicing refraining from the initial moment of refraining interval in order to release emotion quickly, memorize and master song rhythm.

The refrain interval is the complete interval formed by the package of the start time and the end time of the refrain part, singers generally enter the refrain interval by manually adjusting the song progress, the operation is very inconvenient, the existing mode of positioning music refrain generally carries out marking by manpower, however, the total time length of each song, the refrain time length and the refrain cycle frequency are not always the same, the manual marking usually consumes a lot of time, and the manual operation error exists, and the music refrain is difficult to be positioned automatically while processing a large amount of music in batch.

With the rise of the network karaoke platform and the explosive demand of public singing entertainment, especially the singing demand of the music karaoke, how to intelligently, quickly and accurately position the music karaoke is a technical problem to be solved urgently in the field of audio processing.

Disclosure of Invention

In order to solve the above technical problems, embodiments of the present application respectively provide an audio processing method and apparatus, an audio processing device, and a computer readable storage medium, in which a chorus start point probability, an interval probability, and an end point probability are obtained according to a time sequence characteristic of an audio to be processed, so as to intelligently, quickly and accurately position a music chorus.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided an audio processing method, including:

determining a candidate interval contained in the audio to be processed according to the time sequence characteristics of the audio to be processed;

calculating the matching probability of the candidate interval based on the starting point probability, the end point probability and the interval probability corresponding to the candidate interval;

and if the matching probability is greater than a first preset threshold value, determining the candidate interval as a refrain interval contained in the audio to be processed.

According to an aspect of an embodiment of the present application, there is provided an audio processing apparatus including:

the determining module is configured to determine a candidate interval contained in the audio to be processed according to the time sequence characteristics of the audio to be processed;

the calculation positioning module is configured to calculate the matching probability of the candidate interval based on the starting point probability, the end point probability and the interval probability corresponding to the candidate interval; and if the matching probability is greater than a first preset threshold value, determining the candidate interval as a refrain interval contained in the audio to be processed.

In another exemplary embodiment, the audio processing apparatus further includes:

the subtask module is configured to construct a first deep network, and the first deep network is formed by sequentially stacking a convolutional neural network layer, a multi-layer long-short-term LSTM network and a full-connection network layer; inputting audio to be processed into the first deep network; and extracting deep features from the first deep network as time sequence features of the audio to be processed.

In another embodiment, the subtask module further includes:

and the extraction unit is configured to extract the characteristics output by the last LSTM network layer as the time sequence characteristics of the audio to be processed.

In another embodiment, the determining module comprises:

the time sequence modeling unit is configured to input the time sequence characteristics of the audio to be processed into a second deep network, and obtain the refraining position probabilities of all moments contained in the audio to be processed output by the second deep network, wherein the refraining position probabilities comprise a starting point probability and an end point probability;

and the candidate interval construction unit is configured to determine a candidate interval from the audio to be processed according to the obtained starting point probability and the end point probability.

In another embodiment, the candidate interval construction unit is configured to select the first time as the start time of the candidate interval if the start point probability corresponding to the first time of the audio to be processed is greater than a second preset threshold or the start point probabilities of all the times in a preset interval; and if the probability of the end point corresponding to the second moment of the audio to be processed is greater than a third preset threshold value or the probability of the end point at all the moments in a preset interval, selecting the second moment as the end moment of the candidate interval.

In another embodiment, the computational location module comprises:

the interval feature extraction unit is configured to determine the interval features of the candidate intervals according to the time sequence features of the audio to be processed;

and the interval probability extraction unit is configured to input the interval characteristics into a third deep network to obtain the interval probability of the candidate interval output by the third deep network.

In another embodiment, the computational location module comprises:

the interval feature extraction unit is configured to respectively perform feature extraction on the candidate interval, the interval located near the starting point of the candidate interval and the interval located near the end point of the candidate interval to obtain corresponding candidate interval features, wherein the feature length of each candidate interval feature is the same; splicing the candidate interval features to obtain the interval features of the candidate intervals;

According to an aspect of an embodiment of the present application, there is provided an audio processing apparatus including: receiving means for receiving audio to be processed; the controller is electrically connected with the receiving device to execute the method.

According to an aspect of embodiments of the present application, there is also provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor of a computer, cause the computer to perform the above-mentioned method.

According to an aspect of an embodiment of the present application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the above-described method.

According to the technical scheme, the candidate interval is determined according to the time sequence characteristics of the audio to be processed, the matching probability of the candidate interval is calculated based on the starting point probability, the ending point probability and the interval probability corresponding to the candidate interval, the middle position of the refrain interval can be accurately determined according to the interval probability, the condition that multiple refrain intervals are overlapped can be avoided, and the complete and accurate refrain interval can be determined according to the comparison result of the matching probability and the first preset threshold.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;

FIG. 2 is a flow chart illustrating a method of audio processing according to an exemplary embodiment of the present application;

FIG. 3 is a flow chart of another audio processing method proposed based on the embodiment shown in FIG. 2;

fig. 4 is a flowchart illustrating a method for determining candidate intervals included in audio to be processed according to another exemplary embodiment of the present application;

fig. 5 is a flowchart illustrating a method for determining candidate intervals included in audio to be processed according to another exemplary embodiment of the present application;

fig. 6 is a flowchart illustrating an example of obtaining an interval probability of a candidate interval included in audio to be processed according to another embodiment of the present application;

fig. 7 is a flowchart illustrating another method for obtaining interval probabilities of candidate intervals included in audio to be processed according to another exemplary embodiment of the present application;

fig. 8 is a schematic structural diagram of an audio processing apparatus according to an exemplary embodiment of the present application;

fig. 9 is a schematic flowchart illustrating audio processing performed by an audio processing apparatus according to an exemplary embodiment of the present application;

fig. 10 is a schematic diagram illustrating a computer system of an audio processing device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Reference to "a plurality" in this application means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

First, Artificial Intelligence (AI) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and extend human Intelligence, perceive the environment, acquire knowledge, and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The method disclosed by the embodiment of the application can be used for carrying out audio processing at the cloud end, storing the music refrain obtained through processing in the remote database and directly sending the music refrain to other ports.

Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

The audio processing method and apparatus, the audio processing device, and the computer-readable storage medium according to the embodiments of the present application relate to the artificial intelligence technology, the machine learning technology, and the cloud technology, which are described above, and the embodiments will be described in detail below.

Referring first to fig. 1, fig. 1 is a schematic diagram of an implementation environment related to the present application. The implementation environment includes a terminal 100 and a server 200, and the terminal 100 and the server 200 communicate with each other through a wired or wireless network.

The terminal 100 is configured to receive a video or audio file and transmit the received video or audio file to the server 200; the server 200 obtains the audio to be processed according to the video or audio file, determines that the audio to be processed includes the refrain interval, and sends the determined refrain interval to the terminal 100.

For example, after obtaining the audio to be processed according to the video or audio file, the server 200 determines a candidate interval included in the audio to be processed according to the time sequence characteristic of the audio to be processed, calculates a matching probability of the selected interval based on a start point probability, an end point probability and an interval probability corresponding to the candidate interval, and determines the candidate interval as a refrain interval included in the audio to be processed if the matching probability is greater than a first preset threshold.

The user terminal 100 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like, and any electronic device capable of implementing image visualization may be, for example, a smart phone, a tablet, a notebook, a computer, and the like, which is not limited in this respect. The server 200 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, where the plurality of servers may form a block chain, and the server is a node on the block chain, and the server 200 may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform, which is not limited herein.

Referring to fig. 2, fig. 2 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present application, which may be specifically executed by the server 200 in the implementation environment shown in fig. 1. Of course, the method may also be applied to other implementation environments and executed by a server device in other implementation environments, and the embodiment is not limited thereto. As shown in fig. 2, the method at least includes steps S210 to S230, which are described in detail as follows:

s210: and determining the candidate interval contained in the audio to be processed according to the time sequence characteristics of the audio to be processed.

The audio to be processed in this embodiment may be obtained by preprocessing a video or music file.

For example, an audio track in a video or music file is extracted as the audio to be processed, where the audio to be processed includes an audio timing sequence, an audio duration, and the like, the audio timing sequence indicates that the audio parameters are arranged according to a time sequence, for example, a tone is a music parameter of the audio to be processed, and a plurality of tones are arranged according to a certain time sequence to combine into a piece of music, that is, a piece of audio to be processed. In addition, the audio time duration includes a plurality of audio moments, and each audio moment has a corresponding tone.

The time sequence feature of this embodiment is a feature related to an audio time sequence obtained by extracting a feature of an audio to be processed, and the time sequence feature is a feature related to an audio time length or an audio time, and the feature may be used to determine a candidate interval in the audio to be processed.

The candidate interval of the embodiment is not arbitrarily constructed, and is an interval determined from the audio interval of the audio to be processed according to the time sequence feature, and it can be understood that a section of the audio interval is extracted or divided from the audio interval of the audio to be processed by using the time sequence feature, for example, a starting point and an end point of the candidate interval in the audio interval of the audio to be processed are obtained by using the time sequence feature, and the candidate interval can be divided from the starting point and the end point. It should be noted that, in the embodiment, the step S210 may determine a plurality of candidate intervals, which is not limited herein.

Illustratively, the time sequence characteristics of the audio to be processed are input into a deep network to obtain a probability matrix P (T3) which represents the probability that each moment belongs to the start, the interior and the end of the chorus, and accordingly, the moment which belongs to the maximum chorus start probability is taken as the start moment of the candidate interval, and the moment which belongs to the maximum chorus end probability is taken as the end moment of the candidate interval, so that the candidate interval is determined. Since multiple refrain intervals may exist in one song, such as A, B and C refrain, multiple start moments and end moments can be correspondingly determined, multiple candidate intervals can be determined, and then the multiple candidate intervals are judged, screened and determined to determine one or more accurate refrain intervals.

S220: and calculating the matching probability of the candidate interval based on the corresponding starting point probability, end point probability and interval probability of the candidate interval.

The starting point probability represents the probability of the starting position point or the moment of the refrain interval belonging to the audio to be processed in the candidate interval, and the value range of the starting point probability is [0, 1 ]. For example, the probability of the starting point of the candidate interval is 0.9, the probability of the starting position point or the moment of the candidate interval belonging to the refrain interval of the audio to be processed at the position or the moment is 0.9, and the higher the probability is, the higher the matching degree is.

The probability of the end point represents the probability of the end position point or the moment of the refrain interval belonging to the audio to be processed in the candidate interval, and the value range of the probability of the end point is [0, 1 ]. For example, if the probability of the end point of the candidate interval is 0.8, it indicates that the probability that the position or the time belongs to the end position point or the end time of the refrain interval of the audio to be processed is 0.8, and in the same way, the larger the probability, the higher the matching degree is; the smaller the probability, the lower the degree of matching.

In general, since the candidate interval is determined in step S210, the probability of the starting point and the probability of the ending point corresponding to the candidate interval in this step are both high, and the probability values of both are at least greater than 0.5, because the candidate interval is not determined with a smaller probability of the starting point or the ending point in the preceding step, i.e., the position point or time of the interval with a smaller probability of the starting point or the ending point is not used as the starting position point (starting time) or the ending position point (ending time) of the candidate interval. However, in a special case, the starting point probability or the ending point probability corresponding to the determined candidate interval may be smaller due to an indefinite factor in the preceding step, and a description is not made here, so that the description of the starting point probability and the ending point probability in a general case is not limited to the value ranges of the starting point probability and the ending point probability in the present embodiment.

The interval probability represents the probability that the candidate interval belongs to the middle interval of the refrain interval of the audio to be processed, and the value range of the interval probability is [0, 1 ]. The interval probability can show the relevance of the starting position point and the ending position point of the candidate interval, and all the refrain intervals of the audio to be processed can be accurately and completely determined.

The term "calculating" in this embodiment means multiplying the starting point probability, the ending point probability and the interval probability, and the corresponding product is the matching probability, for example, if the starting point probability of a candidate interval is 0.9, the ending point probability is 0.8, and the interval probability is 0.7, the matching probability of the candidate interval is 0.504.

The matching probability represents the matching degree of the candidate interval and the refrain interval, the matching probability of each candidate interval is not necessarily the same, and the higher the matching probability is, the more likely the candidate interval is to be the refrain interval in the audio to be processed.

It should be particularly noted that the interval probability introduced by this embodiment shows the relevance of the start position point and the end position point of the candidate interval, for example, A, B, C refrain interval may exist in a to-be-processed audio, when determining the candidate interval, it may occur that the start position point of the a refrain interval is taken as the start position point of the candidate interval, and the end position point of the B refrain interval is taken as the end position point of the candidate interval, because there may exist other audio intervals that do not belong to the refrain interval between the a refrain interval and the B refrain interval, the obtained candidate interval is not a refrain interval in the true sense, in this case, the start point probability and the end point probability of the candidate interval are both relatively high, such as 0.97 and 0.95, and the product of the two is as high as 0.9215, but because the start position point and the end position point of the candidate interval do not belong to the same refrain interval, and the audio interval which does not belong to the refrain interval exists, so the relevance of the two is low, the interval probability is smaller, if 0.12, the matching probability of the candidate interval is about 0.11, the numerical value of the matching probability is obviously reduced, the confusion of the starting position point and the ending position point of different refrain intervals is avoided, and the condition that the candidate interval is inaccurate occurs.

In addition, because at least one refrain interval may exist in one audio to be processed, the introduction of interval probability can determine all candidate intervals which may be the refrain interval of the audio to be processed at one time, and can accurately determine the middle interval of the corresponding refrain interval while accurately determining the start position point and the end position point of A, B, C refrain intervals, and better determine at least 3 candidate intervals with high matching degree with A, B, C refrain intervals.

S230: and if the matching probability is greater than a first preset threshold value, determining the candidate interval as a refrain interval contained in the audio to be processed.

The first prediction threshold represents a threshold value, typically an empirical value, at which the match probability corresponds to the refrain interval. In this embodiment, the first predetermined threshold is a number greater than 0 and less than 1, for example, 0.5, and if the matching probability of a certain candidate interval is greater than 0.5, the candidate interval is output, and the candidate interval is determined to be a refrain interval included in the audio to be processed. The value range of the corresponding matching probability is [0, 1], and a special case that the matching probability is 1 needs to be described here, that is, a very ideal critical state value, can indicate that 100% of the candidate interval is a refrain interval included in the audio to be processed, in a deep network or related algorithm model introduced in the subsequent embodiments of the present application, the middle value in the process cannot be taken as the related critical value, and the probability of a start point, the probability of an end point, and the probability of an interval cannot be all 1, but the matching probability of the candidate interval of the present application cannot be limited to be taken as the selected deep network or related algorithm model, that is, the description of the embodiment of the present application does not limit the value.

The matching probability of each candidate interval is compared with the first preset threshold value, direct numerical judgment is more visual and convenient, and if parameters in the steps are adjusted, the value of the first preset threshold value is only needed to be finely adjusted, namely, the numerical value of the first preset threshold value is changed, so that the method can be used for positioning the refraining interval.

The interval probability of this embodiment can be accurate the intermediate position of confirming the chorus interval, avoids the condition that the multistage chorus interval overlaps to take place, confirms complete and accurate chorus interval according to the comparative result of matching probability and first preset threshold value.

Referring to fig. 3, fig. 3 is a flowchart of another audio processing method according to the embodiment shown in fig. 2. The method further includes steps S310 to S330 before step S210 shown in fig. 2, which will be described in detail below:

s310: and constructing a first deep network, wherein the first deep network is formed by sequentially stacking a convolutional neural network layer, a multi-layer long and short term LSTM network and a fully-connected network layer.

The first deep network of this embodiment is formed by stacking a Conv2D network layer, an LSTM network layer, and an FC in this order.

Conv2D is a Convolutional Neural Network (CNN) with two convolution kernels, which is a kind of feed forward Neural network (fed forward Neural network) containing convolution computation and having a deep structure, and is one of the representative algorithms of deep learning (deep learning). Where, Conv2D (batch, new _ rows, new _ cols, filters), new _ rows and new _ cols represent the number of rows and columns of the output matrix after convolution of the text by 1 convolution kernel (since the convolution kernel can move right and down, 1 matrix is obtained), and there are as many convolution kernels filters as there are matrices.

The LSTM (Long Short-Term Memory network) is a time-cycle neural network, and is specially designed to solve the Long-Term dependence problem of general RNNs (cyclic neural networks), and all RNNs have a chain form of a repetitive neural network module.

FC (full connected network) functions to nonlinearly combine the extracted features to get an output. The fully-connected layer is located at the last part of the hidden layer of the convolutional neural network and only signals are transmitted to other fully-connected layers. The feature map loses spatial topology in the fully connected layer, is expanded into vectors and passes through the excitation function.

And a large number of audio songs with lyric information are collected in the early stage, each song is accompanied by the start-stop time of each lyric, and all chorus intervals of each audio are labeled for training a first deep network. In order to prevent the situation that the position of the boundary of the interval has errors during the labeling, the boundary of the chorus interval is agreed in advance to be overlapped with the starting and stopping time of two sentences of lyrics, so that only the lyrics belonging to the chorus interval need to be judged, and the consistency and the effectiveness of the labeling result are improved.

Illustratively, the first deep network may be constructed according to the following procedure: a batch of audio intervals is randomly extracted, wherein each audio interval has the same length, for example, 30 s. For each audio interval, extracting Mel (mapping an audio spectrogram to Mel scale to obtain Mel spectrum), and converting a one-dimensional audio signal into a two-dimensional frequency spectrum signal (time _ steps, Mel _ bins); and then constructing a first deep network, inputting the spectrum signal and outputting the audio attribute.

In this embodiment, a first deep network is obtained by using a stacked Conv2D, LSTM and FC network structure, on one hand, by continuously modeling frequency spectrum information, not only can the depth of the network be increased, but also complete time information can be retained; on the other hand, the first deep network can output the probability that each moment belongs to the refrain interval so as to facilitate the subsequent determination of the candidate interval.

For the training of the first deep network, the marking information can be combined, and a cross entropy loss function (commonly used for classification tasks) loss function is adopted to realize the binary classification effect by counting each moment, so that the accuracy rate of the network model can reach 89.4%,

it should be noted that, the detailed description of the structure of the first deep network in this embodiment does not represent a limitation to the structure of the first deep network, and in a practical application scenario, a plurality of LSTM networks may exist in the first deep network.

S320: the audio to be processed is input into a first deep network.

The audio to be processed in this embodiment is a complete song or video, which is subjected to relevant preprocessing, and can be used as an input signal of the first deep network, and the audio interval information is carried.

S330: and extracting deep features from the first deep network as time sequence features of the audio to be processed.

After the new audio to be processed is input into the first deep network, the present embodiment extracts deep features as time series features of the audio to be processed, and dimensions of the time series features may be denoted as T × D, where T denotes a music length, and D denotes a feature dimension at a current time, for example, D is 1024. Meanwhile, the timing characteristic can train the first deep network to continuously improve the performance of the first deep network. The dimensions of the timing characteristics illustrated herein are not limited to T × D, and may be adjusted according to the first depth network, so that the dimensions or the dimension units vary, and are not limited herein, and thus, the scope of the present application is not limited.

In addition, the deep features extracted from the first deep network are features of the LSTM network layer, which may be features of the last LSTM network layer or features of the intermediate LSTM network layer, and the embodiment does not specifically limit the extracted positions.

The embodiment further clarifies how to obtain the time sequence characteristics of the audio to be processed, and introduces the structure of the first deep network, and the prediction accuracy of the first deep network model adopted by the embodiment on the time sequence characteristics can reach 89.4%, so that the accuracy of extracting the deep characteristics as the time sequence characteristics of the audio to be processed is higher when the audio to be processed is processed through the first deep network, and the accuracy of locating the refrain can be greatly improved when the candidate interval contained in the audio to be processed is determined based on the time sequence characteristics subsequently.

In another exemplary embodiment, based on the step S330, the feature output by the last LSTM network layer is extracted as the time-series feature of the audio to be processed.

The time sequence feature of the embodiment is the feature of the last LSTM network layer extracted, and is not the feature output by the full connection layer, the deeper the extracted network layer feature is, the more accurate the time sequence feature of the embodiment is, and the accuracy of the time sequence feature used for positioning the subsequent refrain is higher.

Fig. 4 is a flowchart illustrating a process of determining candidate intervals included in audio to be processed according to another exemplary embodiment of the present application. As shown in fig. 2, the process of determining the candidate intervals included in the audio to be processed according to the time-sequence characteristics of the audio to be processed includes steps S410 to S420, which are described in detail below:

s410: and inputting the time sequence characteristics of the audio to be processed into the second deep network to obtain the refrain position probabilities of all moments contained in the audio to be processed output by the second deep network, wherein the refrain position probabilities comprise a starting point probability and an end point probability.

The second deep network in this embodiment is a deep network formed by stacking Conv1D, and the Conv1D is a convolutional neural network having one convolution kernel. Where, Conv1D (batch, newstages, filters), 1 convolution kernel checks the number of rows of the output column vector after the text is convolved (of course, since the convolution kernel can only move downward, 1 column vector is obtained), and there are as many convolution kernels filters as there are column vectors.

And inputting the time sequence characteristics (T x D) to a second depth network, and outputting a probability matrix P (T x 3) by the second depth network, wherein the probability matrix P represents the probability that each moment belongs to the initial cepstrum strum, the internal point and the end point of the chorus. The probability matrix P (T × 3) illustrated here does not represent that the time sequence feature in the present application can only be P (T × 3), and it can be adjusted according to the second depth network so that the dimension changes, and is not limited here, and the scope of protection of the present application is not limited.

And in the training stage, a corresponding label matrix G (T x 3) and a probability matrix P obtained by network prediction are formed according to the labeled data, and a Loss function is calculated according to Cross entry Loss.

For example, the (t,0) element value of the probability matrix P may represent the probability that the moment t is the beginning of the refrain, and the (t,1) element value may represent the probability that the moment t is the end of the refrain.

S420: and determining a candidate interval from the audio to be processed according to the obtained starting point probability and the end point probability.

According to the probability matrix P (T x 3), the probability of three types at all times of the audio interval to be processed can be determined, namely the probability of the starting point of the chorus, the probability of the internal point and the probability of the end point, so that the candidate interval can be determined from the audio to be processed according to the probability of the starting point and the probability of the end point.

The embodiment illustrates how to determine the candidate interval contained in the audio to be processed, the time sequence characteristics are input into the second deep network, the probability matrix is output to obtain the probability of the position of the refrain, and the candidate interval is further determined from the audio to be processed according to the probability of the starting point and the probability of the ending point. The time sequence characteristics are trained in the second deep network, so that the relevance between the output probability matrix and the singing position probability is stronger, the obtained candidate interval is more accurate, and the matching between the candidate interval and the singing interval is more accurate.

Fig. 5 is a flowchart illustrating a method for determining candidate intervals included in audio to be processed according to another exemplary embodiment of the present application. The process of determining the candidate interval from the audio to be processed according to the obtained starting point probability and the end point probability includes steps S510 to S520, which are described in detail below:

s510: and if the starting point probability corresponding to the first moment of the audio to be processed is greater than a second preset threshold value or the starting point probabilities of all moments in a preset interval, selecting the first moment as the starting moment of the candidate interval.

The preset interval represents a fixed-length interval near a position point corresponding to the first time of the audio to be processed, for example, the first time of the audio to be processed is 1 minute and 30 seconds, if the length of the preset interval is 20 seconds, the position point corresponding to the time of the preset interval of 1 minute and 20 seconds is the starting point of the preset interval, and the position point corresponding to the time of 1 minute and 40 seconds is the end point of the preset interval. In addition, only the first time needs to be limited to be within the preset interval, and the specific starting point and the specific end point of the preset interval are not limited, for example, the first time is 1 minute 30 seconds, the position point corresponding to the time of 1 minute 28 seconds is the starting point of the preset interval, and the position point corresponding to the time of 1 minute 48 seconds is the end point of the preset interval. Here, the length of the preset section and the positions of the start point and the end point of the preset section are defined, so the scope of protection of the present application is not limited.

According to the probability matrix P (T × 3) in step S420, if the probability of the starting point corresponding to a certain time is greater than a preset second preset threshold, or greater than the probabilities of the starting points at all times in the preset interval, the time is defined as the starting point time of the candidate interval.

Illustratively, the second preset threshold is 0.9, and the probability of the starting point of 2 minutes 10 seconds is 0.92, that is, greater than 0.9, then 2 minutes 10 seconds is defined as the starting point time of the candidate interval. Or the preset interval is 10 seconds, the starting point position is a position point corresponding to 2 minutes and 5 seconds, and the end point position is a position point corresponding to 2 minutes and 15 seconds, wherein 1 second is taken as a time, 10 times are counted from 2 minutes and 5 seconds to 2 minutes and 15 seconds, each time corresponds to a starting point probability which is 0.2, 0.25, 0.50, 0.6, 0.66, 0.32, 0.8, 0.5, 0.4 and 0.7 in sequence, and the starting point probability 0.92 of 2 minutes and 10 seconds is greater than the starting point probabilities of all the times in the preset interval, so that 2 minutes and 10 seconds are selected as the starting time of the candidate interval.

S520: and if the probability of the end point corresponding to the second moment of the audio to be processed is greater than a third preset threshold value or the probability of the end point at all the moments in the preset interval, selecting the second moment as the end moment of the candidate interval.

Similarly, if the probability of the end point corresponding to a certain time is greater than a preset third preset threshold, or greater than the probabilities of the end points at all times in the preset interval, the time is defined as the time of the end point of the candidate interval.

The second preset threshold and the third preset threshold in this embodiment may be adjusted according to different depth networks, and are not necessarily fixed and unique values, and the second preset threshold and the third preset threshold may be the same or different, and the description of this example does not limit the range thereof. Similarly, the first preset threshold may also be adjusted according to the difference of the depth network, and is not necessarily a fixed and unique value, and the first preset threshold may be the same as or different from the second preset threshold and/or the third preset threshold.

In a preferred embodiment, after determining the candidate interval included in the audio to be processed, the candidate interval may be further evaluated, and the start point probability, the end point probability and the interval probability may be used for evaluation, where the start point probability and the end point probability are respectively taken from corresponding elements of the probability matrix P, and the interval probability needs to be further modeled.

The embodiment further illustrates how to determine the candidate interval included in the audio to be processed by using the probability of the starting point and the probability of the ending point, the probability matrix obtains the probability of the starting point and the probability of the ending point belonging to the refrain at the moment of the audio interval, the judgment is carried out by a preset threshold value, or the moment with the maximum relevant corresponding probability at all the moments in the preset interval is taken as the starting moment or the ending moment of the candidate interval, so that the candidate interval included in the audio to be processed is determined. The judgment process is more intelligent, manual control is not needed, and the accuracy is higher.

Fig. 6 is a flowchart illustrating an example of obtaining an interval probability of a candidate interval included in audio to be processed according to another embodiment of the present application. After determining the candidate interval process included in the audio to be processed according to the time sequence characteristics of the audio to be processed, steps S610 to S620 are included, which are described in detail below:

s610: and determining interval characteristics of the candidate intervals according to the time sequence characteristics of the audio to be processed.

In order to further evaluate the candidate interval, an interval feature needs to be constructed so as to obtain an interval probability, and the interval probability is used for evaluating the candidate interval. The interval feature here is a feature that can represent the entire audio interval within the candidate interval, and is highly correlated with the candidate interval, and in addition, the interval feature is an intermediate process parameter for obtaining the interval probability of the candidate interval.

S620: and inputting the interval characteristics into the third deep network to obtain the interval probability of the candidate interval output by the third deep network.

This embodiment constructs one deep network stacking Conv1D and FCs, i.e., the third deep network in this embodiment. Inputting the interval characteristics obtained by splicing into a third deep network, and outputting the intersection ratio of the interval characteristics and the real refrain interval by the third deep network, namely the interval probability of the embodiment is higher, and the higher the interval probability is, the higher the probability that the interval is the complete refrain interval is.

The embodiment illustrates how to obtain the interval probability of the candidate interval included in the audio to be processed, determine the interval characteristic of the candidate interval by using the time sequence characteristic, and input the interval characteristic into the third deep network to obtain the interval probability of the candidate interval. In the embodiment, the candidate interval is determined by using the time sequence characteristics to obtain the interval characteristics, the relevance of the preposed data is increased, and the time sequence characteristics are used as basic parameters to perform network training output to obtain other intermediate parameters.

Fig. 7 is a flowchart illustrating another interval probability method for obtaining candidate intervals included in audio to be processed according to another exemplary embodiment of the present application. After determining the candidate interval process included in the audio to be processed according to the time sequence characteristics of the audio to be processed, steps S710 to S730 are included, which will be described in detail below:

s710: and respectively extracting the features of the candidate interval, the interval positioned near the starting point of the candidate interval and the interval positioned near the end point of the candidate interval to obtain corresponding candidate interval features, wherein the feature length of each candidate interval feature is the same.

The interval located near the start point of the candidate interval and the interval located near the end point of the candidate interval may be pre-set in advance, so that the lengths of the two intervals are fixed, for example, after the candidate interval is determined, the corresponding start point and end point are obtained, the interval with the preset interval length is automatically selected with the start point as the midpoint and is used as the interval near the start point of the candidate interval, or the interval with the preset interval length is selected with the start point as 1/3 and is used as the interval near the start point of the candidate interval, where this is not limited, and similarly, the interval located near the end point of the candidate interval is obtained.

The present embodiment extracts features of a candidate interval, an interval located near the start point of the candidate interval, and an interval located near the end point of the candidate interval, to obtain a feature of a fixed length.

S720: and splicing the candidate interval features to obtain the interval features of the candidate intervals.

In a preferred embodiment, the features of the candidate interval, the interval located near the starting point of the candidate interval, and the interval located near the end point of the candidate interval may be extracted as three features of fixed length, so as to facilitate the connection of the full connection layer of the subsequent third deep network, and the three features of fixed length are spliced to obtain the interval features of the candidate interval.

S730: and inputting the interval characteristics into the third deep network to obtain the interval probability of the candidate interval output by the third deep network.

This step is similar to step S620 described above, the third deep network is a network formed by stacking Conv1D and FC, and the interval probability obtained in this step can be used to evaluate the candidate interval.

In this embodiment, another way of obtaining the interval probability is provided, in which feature extraction is performed on the candidate interval, and positions near a start point and an end point of the candidate interval, the obtained features of fixed length are spliced into the interval feature of the candidate interval, and the interval feature is input to the third depth network to obtain the interval probability of the candidate interval. The embodiment separately extracts the characteristics of the related interval, the constructed interval characteristics can be better closely connected with the full connection layer of the third deep network, and the output interval characteristics are more accurate than the intersection of the real refrain interval, namely, the interval probability.

It should be noted that A, B, C refrains may exist in one song, and if the candidate interval is determined only according to the starting point probability and the end point probability, an error of a starting point or an end point of the candidate interval may occur, for example, the starting point of the B refrain is taken as the starting point of the candidate interval, and the end point of the C refrain is taken as the end point of the candidate interval, because a music interval which does not belong to the refrain interval may exist between the B refrain and the C refrain, the candidate interval is not an accurate refrain interval, and the introduction of the interval probability better avoids such a situation. In addition, in the subsequent process of screening candidate intervals, candidate intervals with more overlapping may also appear, and NMS (Non-Maximum Suppression algorithm) may be used for search screening.

Secondly, a plurality of refrain intervals may exist in one song, namely A refrain, B refrain and C refrain, if the refrain interval of the A refrain is determined according to the time sequence in the identification process, then the refrain interval of the B refrain and C refrain may not be identified subsequently, thus a complete refrain interval is not determined for music with a plurality of refrain intervals, and the technical problem that all refrain intervals of the audio to be processed cannot be identified completely is solved by introducing interval probability.

Therefore, the interval probability is introduced into the audio processing method, not only can the positioning error of the starting point of the candidate interval be avoided, but also all the refrain intervals of the audio to be processed can be ensured to be identified.

The method can be applied to artificial intelligence audio processing equipment and can be used for processing in batches, the equipment can automatically and accurately position all chorus intervals of music after audio is input, the mAP (mean Average Precision) of the method can reach 88.6 percent, the fault tolerance rate is only 2s, the accuracy of the determined chorus starting time and finishing time reaches 93 percent, and the method is more accurate compared with the prior art.

The mAP is an important measurement index in the field of target detection in machine learning and is used for measuring the performance of a target detection algorithm. Generally, the full-class average accuracy (mAP) is obtained by performing a comprehensive weighted average of the average Accuracy (AP) of all class detections.

Another aspect of the present application further provides an audio processing apparatus, as shown in fig. 8, fig. 8 is a schematic structural diagram of the audio processing apparatus shown in an exemplary embodiment of the present application. Wherein, the audio processing apparatus includes:

the determining module 810 is configured to determine a candidate interval included in the audio to be processed according to the time sequence characteristics of the audio to be processed;

a calculating and positioning module 820 configured to calculate a matching probability of the candidate interval based on the starting point probability, the end point probability and the interval probability corresponding to the candidate interval; and if the matching probability is greater than a first preset threshold value, determining the candidate interval as a refrain interval contained in the audio to be processed.

In a preferred embodiment, the interface returns the information of the interval between the two songs of the music to other ports in a Json form, and the lightweight data exchange format is more convenient and faster, and here, only what data exchange format is used to return the information of the interval between the two songs of the music is illustrated, and the data exchange format is not used to represent that the apparatus of the embodiment can only return the information of the interval between the two songs of the music, and the data exchange format is not limited here.

In another embodiment, the audio processing apparatus further comprises:

the subtask module is configured to construct a first deep network, and the first deep network is formed by sequentially stacking a convolutional neural network layer, a multi-layer long-short-term LSTM network and a full-connection network layer; inputting audio to be processed into a first deep network; and extracting deep features from the first deep network as time sequence features of the audio to be processed.

In addition, the subtask module can classify the audio attributes and output audio attribute information.

In another embodiment, the subtask module further includes:

In another embodiment, the determining module 810 includes:

the time sequence modeling unit is configured to input the time sequence characteristics of the audio to be processed into the second deep network, and obtain the refrain position probabilities of all moments contained in the audio to be processed output by the second deep network, wherein the refrain position probabilities comprise a starting point probability and an end point probability;

In another embodiment, the candidate interval construction unit is configured to select the first time as the start time of the candidate interval if the start point probability corresponding to the first time of the audio to be processed is greater than a second preset threshold or the start point probabilities of all the times in the preset interval; and if the probability of the end point corresponding to the second moment of the audio to be processed is greater than a third preset threshold value or the probability of the end point at all the moments in the preset interval, selecting the second moment as the end moment of the candidate interval.

In another embodiment, the compute position module 820 includes:

the interval feature extraction unit is configured to determine interval features of candidate intervals according to the time sequence features of the audio to be processed;

and the interval probability extraction unit is configured to input the interval characteristics into the third deep network to obtain the interval probability of the candidate interval output by the third deep network.

In another embodiment, the compute position module 820 includes:

the interval feature extraction unit is configured to respectively perform feature extraction on the candidate interval, the interval located near the starting point of the candidate interval and the interval located near the end point of the candidate interval to obtain corresponding candidate interval features, wherein the feature length of each candidate interval feature is the same; splicing the candidate interval features to obtain interval features of the candidate intervals;

Fig. 9 is a flowchart illustrating audio processing performed by the audio processing apparatus according to an exemplary embodiment of the present application. The description herein does not limit the module structure of the audio processing apparatus in the present application, and the description can be repeated to divide or combine the module structure of the audio processing apparatus according to the actual situation, in order to implement the audio processing method of the present application.

The audio processing apparatus of the present embodiment includes:

a subtask module: a deep learning network is constructed, local audio attribute classification is carried out on input music, and a subtask module is mainly used for extracting time sequence characteristics in the embodiment of the application.

Illustratively, a video or audio file URL (Uniform Resource Locator) is input, an audio track is extracted as algorithm input music to obtain audio to be processed, the audio to be processed is input into a first deep network, and a feature output by a last LSTM network layer is extracted from the first deep network as a time sequence feature of the audio to be processed.

A time sequence modeling module: processing the obtained time sequence characteristics in a second deep network to obtain the probability of the position of the refrain at each moment contained in the audio to be processed output by the second deep network, wherein the probability of the position of the refrain comprises a starting point probability and an end point probability, and is similar to the time sequence modeling unit in the determining module; and determining a candidate interval from the audio to be processed according to the obtained starting point probability and the end point probability, namely determining the candidate interval contained in the audio to be processed, which is similar to the candidate interval construction unit in the determination module.

The time sequence modeling module of the present embodiment is equivalent to the determining module described above, and is configured to determine the candidate intervals included in the audio to be processed according to the time sequence characteristics of the audio to be processed.

A candidate interval evaluation module: and selecting the candidate interval with higher evaluation score as the refraining interval.

The candidate interval evaluation module of the present embodiment is equivalent to the above-mentioned calculation positioning module, and is configured to calculate the matching probability of the candidate interval based on the start point probability, the end point probability and the interval probability corresponding to the candidate interval; if the matching probability is greater than a first preset threshold value, determining that the candidate interval is a refrain interval contained in the audio to be processed, and finally returning information of the audio refrain interval to the client side in a Json (JavaScript Object Notation) form.

Particularly, the method is used for extracting the time sequence characteristics of the independent task, namely, the audio to be processed is input into a first deep network at a subtask module, deep characteristics are extracted from the first deep network to serve as the time sequence characteristics of the audio to be processed, then the time sequence characteristics are input into a second deep network in a determining module, the refrain position probability of each moment contained in the audio to be processed is obtained, subsequently, a third deep network in a positioning module can be calculated according to the time sequence characteristics input, the dense interval probability is obtained, and therefore the complete refrain interval can be accurately determined subsequently. Three deep networks are used for modeling and learning in the audio processing process, so that the matching probability is more accurate.

It should be noted that the audio processing apparatus provided in the foregoing embodiment and the audio processing method provided in the foregoing embodiment belong to the same concept, and specific ways for the modules and units to perform operations have been described in detail in the method embodiment, and are not described herein again.

Another aspect of the present application also provides an audio processing apparatus, including: receiving means for receiving audio to be processed;

and the controller is electrically connected with the receiving device to execute the audio processing method in the above embodiments.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer system of an audio processing device according to an exemplary embodiment of the present application, which shows a schematic structural diagram of a computer system suitable for implementing the audio processing device according to the embodiment of the present application.

It should be noted that the computer system 1000 of the audio processing device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage portion 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for system operation are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio processing method as before. The computer-readable storage medium may be included in the audio processing apparatus described in the above embodiments, or may exist separately without being assembled into the audio processing apparatus.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio processing method provided in the above-described embodiments.

According to an aspect of an embodiment of the present application, there is also provided a computer system including a Central Processing Unit (CPU) that can perform various appropriate actions and processes, such as performing the method in the above-described embodiment, according to a program stored in a Read-Only Memory (ROM) or a program loaded from a storage portion into a Random Access Memory (RAM). In the RAM, various programs and data necessary for system operation are also stored. The CPU, ROM, and RAM are connected to each other via a bus. An Input/Output (I/O) interface is also connected to the bus.

The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section including a hard disk and the like; and a communication section including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is mounted into the storage section as necessary.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An audio processing method, comprising:

2. The method according to claim 1, wherein before determining the candidate interval included in the audio to be processed according to the time-sequence characteristics of the audio to be processed, the method comprises:

constructing a first depth network, wherein the first depth network is formed by sequentially stacking a convolutional neural network layer, a multi-layer long and short term LSTM network and a fully-connected network layer;

inputting audio to be processed into the first deep network;

and extracting deep features from the first deep network as time sequence features of the audio to be processed.

3. The method of claim 2, wherein the extracting deep features from the first deep network as temporal features of the audio to be processed comprises:

and extracting the characteristics output by the last LSTM network layer as the time sequence characteristics of the audio to be processed.

4. The method according to claim 1, wherein the determining the candidate interval included in the audio to be processed according to the time-sequence characteristics of the audio to be processed comprises:

inputting the time sequence characteristics of the audio to be processed into a second deep network, and obtaining the refrain position probability of each moment contained in the audio to be processed output by the second deep network, wherein the refrain position probability comprises a starting point probability and an end point probability;

and determining a candidate interval from the audio to be processed according to the obtained starting point probability and the end point probability.

5. The method according to claim 4, wherein the determining the candidate interval from the audio to be processed according to the obtained starting point probability and end point probability comprises:

if the starting point probability corresponding to the first moment of the audio to be processed is greater than a second preset threshold value or the starting point probabilities of all moments in a preset interval, selecting the first moment as the starting moment of the candidate interval;

and if the probability of the end point corresponding to the second moment of the audio to be processed is greater than a third preset threshold value or the probability of the end point at all the moments in a preset interval, selecting the second moment as the end moment of the candidate interval.

6. The method according to any one of claims 1-5, further comprising:

determining interval characteristics of the candidate intervals according to the time sequence characteristics of the audio to be processed;

and inputting the interval characteristics into a third deep network to obtain the interval probability of the candidate interval output by the third deep network.

7. The method according to any one of claims 1-5, further comprising:

respectively extracting features of the candidate interval, the interval near the starting point of the candidate interval and the interval near the end point of the candidate interval to obtain corresponding candidate interval features, wherein the feature length of each candidate interval feature is the same;

splicing the candidate interval features to obtain the interval features of the candidate intervals;

8. An audio processing apparatus, comprising:

9. An audio processing device, comprising:

receiving means for receiving audio to be processed;

a controller electrically connected to the receiving device to perform the method of any one of claims 1-7.

10. A computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1 to 7.