CN111681679A

CN111681679A - Video object sound effect searching and matching method, system and device and readable storage medium

Info

Publication number: CN111681679A
Application number: CN202010518575.8A
Authority: CN
Inventors: 薛媛; 金若熙
Original assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Current assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-09-18
Anticipated expiration: 2040-06-09
Also published as: CN111681679B

Abstract

The invention discloses a video object sound effect searching and matching method, which comprises the following steps: acquiring the category of a specific sound-producing object and constructing the audio frequency of the specific sound-producing object based on the video to be processed; processing the category of a specific sound-producing object in the video to be processed and the audio introduction of the audio to obtain a first matching score; obtaining a BERT vector of an object type of a specific sound production object and a BERT vector introduced by an audio frequency, further obtaining cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score; obtaining a video and audio matching score based on the first matching score and the neural network matching score; and selecting the audio corresponding to a plurality of matching scores as the matched audio of the specific sound-producing object according to the audio matching scores. Through searching and matching of the sound effect of the video object, special-effect dubbing is not needed by a dubbing engineer when dubbing is given to the video, the sound effect can be directly and automatically generated and matched into the corresponding video, convenience and rapidness are realized, and the accuracy is high.

Description

Video object sound effect searching and matching method, system and device and readable storage medium

Technical Field

The invention relates to the technical field of video processing, in particular to a method, a system and a device for searching and matching sound effects of video objects and a readable storage medium.

Background

At present, with the development of science and technology, multimedia audio and video technology is widely applied to various fields, and the matching of sound effect for a specific sound object in a video can bring better feeling to audiences, thereby being beneficial to the understanding and cognition of the audiences to various fields, and how to make a good-looking video is more important.

In the existing video processing technology, clipping, special effects, subtitles, audio material adding and the like of a video are performed independently, for example, when sound is added to the video, the video is recorded first and then dubbed, or a person can make a sound first and record the sound directly in the video on site, but the sound except for characters in the video is difficult to match, and sound parts which are not finished in shooting site are made by a sound imitation engineer at the present time through later-stage production, such as footstep sound, door opening and closing sound, water pouring sound and the like, to be matched in the video.

The traditional method for matching special effects to specific objects in a video is slow and low in accuracy, and the operation of synchronizing the video and various sounds is complex, so that the workload of workers is large, a large amount of time is needed, and the operation method is extremely inflexible.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method, a system and a device for searching and matching sound effects of video objects and a readable storage medium.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a method for searching and matching sound effects of video objects comprises the following steps:

acquiring the category of a specific sound-producing object and constructing the audio frequency of the specific sound-producing object based on the video to be processed;

processing the category of a specific sound-producing object in the video to be processed and the audio introduction of the audio to obtain a first matching score;

obtaining a BERT vector of an object type of a specific sound production object and a BERT vector introduced by an audio frequency, further obtaining cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score;

obtaining a video and audio matching score based on the first matching score and the neural network matching score;

and selecting the audio corresponding to a plurality of matching scores as the matched audio of the specific sound-producing object according to the audio matching scores.

As one possible implementation, the audio is parsed into an audio introduction and audio keywords, wherein the audio introduction is an introduction content text of the audio, and the audio keywords include at least three audio-describing words including a category name of a specific sound-producing object and a category name of a sound-producing sound.

As an implementation manner, the processing the category and the audio introduction of the specific sound object in the video to be processed to obtain the first matching score specifically includes:

performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words;

respectively obtaining the word proportion of the object type of a specific sounding object to the audio introduction and the coincidence of the audio keywords to obtain a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type to the audio introduction, the audio introduction weight, the object type to the audio keyword, the word coincidence proportion of the audio keywords, and the audio introduction weight and the audio keyword weight are 1;

obtaining an object type TF-IDF vector based on statistical data introduced by the audio, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector);

and carrying out weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score is the word matching score and the word weight and the TF-IDF matching score and the TF-IDF weight, and the word weight and the TF-IDF weight are 1.

As an implementation manner, the obtaining a video and audio matching score based on the first matching score and the neural network matching score specifically includes:

and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1.

As an implementation manner, the obtaining of the category of the specific sounding object and the constructing of the audio of the specific sounding object based on the video to be processed specifically includes:

based on a video to be processed, extracting a video key frame in a frequency reduction mode, and performing primary identification analysis processing to obtain a modularized specific sound production object;

carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object;

object classes of the specific sounding objects and specific sounding object audio are constructed based on the sounding characteristics.

As an implementation manner, after the step of selecting, according to the audio matching scores, the audios corresponding to the several matching scores as the matched audios of the specific sound-producing object, the method further includes the following steps:

and mixing all the audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous.

A video object sound effect searching and matching system comprises an acquisition and construction module, a first processing module, a second processing module, a matching score acquisition module and a selection and matching module;

the acquisition and construction module is used for acquiring the category of the specific sound-producing object and constructing the audio frequency of the specific sound-producing object based on the video to be processed;

the first processing module is used for processing the category of a specific sound-producing object in the video to be processed and the audio introduction of the audio to obtain a first matching score;

the second processing module is used for acquiring a BERT vector of an object type of a specific sound-producing object and a BERT vector introduced by an audio frequency, further acquiring cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score;

the matching score obtaining module is used for obtaining a video and audio matching score based on the first matching score and the neural network matching score;

and the selection matching module is used for selecting the audio corresponding to the plurality of matching scores as the matched audio of the specific sound-producing object according to the audio matching scores.

As an implementation manner, the audio mixing device further includes a mixing processing module, configured to perform mixing processing on all the audios to form a complete audio file, and add the audio file to the audio track of the video so that the audio file and the video are synchronized.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the following method steps:

A searching and matching device for audio effects of video objects, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor executes the computer program to perform the following method steps:

Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:

the invention discloses a method, a system and a device for searching and matching sound effects of video objects and a readable storage medium, wherein the method, the system and the device are used for acquiring the category of a specific sound-producing object and constructing the audio frequency of the specific sound-producing object based on a video to be processed; processing the category of a specific sound-producing object in the video to be processed and the audio introduction of the audio to obtain a first matching score; obtaining a BERT vector of an object type of a specific sound production object and a BERT vector introduced by an audio frequency, further obtaining cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score; obtaining a video and audio matching score based on the first matching score and the neural network matching score; and selecting the audio corresponding to a plurality of matching scores as the matched audio of the specific sound-producing object according to the audio matching scores. Through searching and matching of the sound effect of the video object, special-effect dubbing is not needed by a dubbing engineer when dubbing is given to the video, the sound effect can be directly and automatically generated and matched into the corresponding video, convenience and rapidness are realized, and the accuracy is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of the process of the present invention;

fig. 2 is a schematic diagram of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Example 1:

a method for searching and matching sound effect of video object, as shown in FIG. 1, includes the following steps:

s100, acquiring the category of a specific sound-producing object and constructing the audio of the specific sound-producing object based on the video to be processed;

s200, processing the category of a specific sound-producing object in the video to be processed and the audio introduction of the audio to obtain a first matching score;

s300, obtaining a BERT vector of the object type of the specific sound production object and a BERT vector introduced by the audio, further obtaining cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score;

s400, obtaining a video and audio matching score based on the first matching score and the neural network matching score;

s500, selecting the audio corresponding to a plurality of matching scores as the matched audio of the specific sound-producing object according to the audio matching scores.

In step S100, the audio is parsed into an audio introduction and audio keywords, wherein the audio introduction is an introduction content text of the audio, and the audio keywords include at least three audio-describing words including a category name of a specific sound-producing object and a category name of a sound-producing sound. Specifically, in order to associate the object type of the specific sound-generating object with the audio of the specific sound-generating object, the natural language is used as an intermediate expression for matching the object type of the specific sound-generating object of the video to be processed with the audio in the embodiment, and the natural language is used as a method for matching the expression, so that the expression is beneficial to understanding and labeling of people, and sorting and maintaining of an audio library.

For video to be processed, the object class identified by the video understanding module is represented as natural language (e.g., "cat"); for audio, two types of natural language markup can be used: audio introduction and audio keywords, i.e. audio includes audio introduction and audio keywords, where an audio introduction is understood to be: the content of the audio is introduced with a sentence or phrase (e.g., "a person's voice walking on snow"), and the audio keyword is the content of the audio with three key words (e.g., "shoes/snow/footsteps"). Unlike audio introductions, audio keywords must include spoken objects and spoken sound categories, and in summary, the introduction of audio keywords links the mismatch between object recognition categories and sound introductions.

For a particular sound-producing object, the class name of the object recognition is used directly as its natural language representation, since the computer cannot understand the natural language, and thus further maps the natural language representation to a vector representation. Specifically, the present embodiment introduces two vector representations in natural language: TF-IDF (term frequency-inverse document frequency) and BERT (bidirectional Encode expressions from transformations).

In a specific embodiment, the TF-IDF vector is computed from the audio introductory text, which indicates how much each word in a segment of a word has an effect on the semantics of the whole segment of the word. The method specifically comprises the following steps: firstly, Chinese word segmentation is carried out on audio introduction of all audios through a word segmentation device 'ending word segmentation': then calculating the word frequency TF of each word in each audio introduction and the word frequency DF of each word in the set of all audio introductions; for an audio presentation, the TF-IDF of any one of the words can be computed: TF-IDF ═ TF × log (1/DF + 1); it is to be noted that the TF-IDF calculation formula is a normalized TF-IDF, in order to ensure the stability of the values; finally, for any segment of characters, the TF-IDF vector of the segment of characters is calculated. All words in the text library are sorted, the TF-IDF value of each word in the segment of words is calculated according to the sequence, and if the segment of words does not contain the word, the TF-IDF value is considered to be 0. And finally, obtaining a vector with the length same as the vocabulary of the text library, namely TF-IDF vector expression of the segment of characters.

Further: and calculating a BERT vector, wherein the BERT in the embodiment is a Transformer neural network structure, network parameters are trained by large-scale unsupervised learning, and the obtained model can be directly applied to downstream natural language understanding problems to directly carry out vector mapping on sentences and phrases in the natural language. The embodiment combines the two (and simple word matching) methods, so that the result is more accurate.

This embodiment computes the BERT vector representation of a sentence using a pre-trained Chinese BERT model in a Pytorch _ predicted _ BERT in a Pytorch library. To meet the efficiency of matching, the minimum BERT model "BERT _ base _ chip" is used. Specifically, a sentence is divided into characters one by one, the first character and the last character of "[ CLS ]" and "[ SEP ]" are respectively added into the sentence to be used as input index _ tokens, a full 0 list which is as long as the input index _ tokens is used as input segment _ ids, the two inputs are simultaneously input into a pre-training BERT model, and an output vector of the last layer of neural network corresponding to the first character ("[ CLS ]") is taken out to be used as a BERT vector of the sentence.

The audio and video matching process is the matching of the object types and audio introduction identified in the video and the audio keywords. Selecting a proper audio frequency according to the calculated matching score, wherein the calculation of the matching score is carried out in two modes, one is a traditional method and the other is a neural network mode, and the traditional method has the advantage that when the natural language expressions of the audio frequency and the video frequency have the same words, the score can be accurately calculated; the neural network has the advantage of calculating the matching scores, when the natural language expressions of the two natural language expressions have no word overlap, the natural language expressions can be matched with each other at will, and the scores of the two methods are used and combined simultaneously, so that the two methods are complementary.

In one embodiment, in step S200, the processing of the category and the audio introduction of the specific sound object in the video to be processed to obtain the first matching score includes the following specific steps:

s210, performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words;

s220, respectively obtaining word proportions of the object type of the specific sounding object, audio introduction and audio keyword coincidence to obtain a first proportion and a second proportion, carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type and the audio introduction, the audio introduction weight, the object type and the audio keyword word coincidence proportion, and the audio keyword weight, wherein the audio introduction weight and the audio keyword weight are 1;

s230, obtaining an object type TF-IDF vector based on statistical data of audio introduction, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector);

s240, performing weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, where the first matching score is the word matching score + the word weight + the TF-IDF matching score TF-IDF weight, and the word weight + the TF-IDF weight is 1.

In steps S210 to S240, the matching score is obtained by a conventional method, and the object type and the voice introduction of the specific sound object are segmented by using a final segmenter. Then calculating the object types of the specific sound production objects, respectively introducing the object types and the sounds, and calculating the word proportion of the coincident sound keywords, and weighting and averaging the two proportions to serve as a word matching score; and obtaining TF-IDF vector expression of the object type of the specific sound-producing object according to the statistical data in the sound introduction text. Then, the cosine similarity of the object TF-IDF vector and the sound introduction TF-IDF vector is calculated to be used as a TF-IDF matching score, and the word matching score and the TF-IDF matching score are weighted and averaged to obtain a matching score of the traditional method, namely the first matching score in the step. Of course, in other embodiments, the technique for obtaining the first matching score may be other techniques, and is not described herein again.

And step S300 obtains the BERT vector of the object type of the specific sound-producing object and the BERT vector introduced by the audio, and further obtains cosine similarity of the BERT vectors, and the cosine similarity is realized by a neural network matching method as a neural network matching score.

In one embodiment, the step 400 of obtaining the video and audio matching score based on the first matching score and the neural network matching score is specifically implemented by the following steps: and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1. Specifically, for each identified object, 10 best matching audios may be selected as dubbing recommendations according to the final matching score, although other numbers are also possible. In practice, the weight of the weighted average may be adjusted as needed, and if it is desired that the name of the object class of the particular spoken object appears accurately in the audio introduction or keyword, the weight of the traditional matching score may be increased to increase accuracy; if it is desired that the names of the object classes for a particular spoken object are not in the audio presentation or keyword, but have the same semantics, the weights of the neural network matching scores may be increased to increase generalization.

In an embodiment, in step S100, based on the video to be processed, the category of the specific sounding object is obtained and the audio of the specific sounding object is constructed, and the specific implementation manner may be as follows:

s110, based on the video to be processed, extracting video key frames in a frequency reduction mode, and performing primary recognition analysis processing to obtain a modularized specific sound production object;

s120, performing multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object;

and S130, constructing the object class of the specific sound-producing object and the specific sound-producing object audio based on the sound-producing characteristics.

In this embodiment, the video to be processed refers to a video clip provided by a user and requiring sound effects, a video key Frame is extracted from the video to be processed by adopting a Frame-down Frame-extracting manner, a Frame-extracting frequency is set as an adjustable parameter, a lower limit of the Frame-extracting frequency is not set, an upper limit of the Frame-extracting frequency is determined by a code-sampling rate of the video (usually, the video is 25 frames per second), and a static Frame Image, i.e., a Frame Image Stream, having a time sequence, can be generated after the Frame-extracting of the video to be processed, and the Frame Image Stream is used for the next specific sound-producing object recognition.

The method for extracting the video key frame in the frequency reduction mode and performing the preliminary identification analysis processing is various, in this embodiment, the method for extracting the video key frame from the video to be processed is implemented by adopting a frequency reduction and frame extraction mode, and the specific steps are as follows:

s111: reducing the frame extraction frequency of the relevant information of the video to be processed, and extracting video key frames;

s112: generating a frame image stream from the extracted video key frames;

s113: and performing modular multi-object recognition on the frame image stream by adopting a deep convolutional neural network model.

In the implementation process, firstly, the video key frame needs to be down-converted: the object/person with dubbing value appearing in the video to be processed needs to have a certain continuous existence time, and the object dubbing which disappears within one or two frames of the video to be processed is generally not considered, because the object/person with dubbing value has little meaning from the viewpoint of dubbing technology. In a specific operation, if the video key-frame in the frame-map stream is such: if the frame 2 seconds before does not contain the recognized object type, the recognized object type is regarded as that the object sounds from the second; if a secondary object exists in the frame from the first 2 seconds, the object is considered to be sounding continuously, and the minimum sounding time is set to be 5 seconds. In actual operation, different continuous sounding time and minimum sounding time can be set for different objects according to the sounding rules of the objects. Extracting video key frames for object recognition by reducing the frequency of the video key frames: for example, a video with a code rate of 25 frames/second is adopted, the frequency of a sampling key frame is set to be 1 frame/second after frequency reduction, namely, one frame is extracted from every 25 key frame pictures to serve as an identification input sample of an object appearing in the video in one second in the future, so that the reading times can be effectively and simply reduced, and the processing speed is improved. Meanwhile, the frame extraction frequency is set as an adjustable parameter, the lower limit of the frame extraction frequency is not set, and the upper limit of the frame extraction frequency is determined by the code acquisition rate of the video (usually, the video is 25 frames per second), so that a user determines the proper frame extraction frequency according to the characteristics of the video sample.

And thirdly, extracting a frame image stream generated by the video key frame, and performing modular multi-object identification based on an embedded deep convolutional neural network (DeepCNN). For each static frame image in the frame image stream, performing high nonlinear operation on pixel values of RGB (red, green and blue) three-color channels of pixel points of the image through a network to generate probability vectors taking each identifiable specific sound-producing object as a center, judging the category of the specific sound-producing object through the maximum value in each probability vector by a deep convolutional neural network, and judging the size of a current object selection frame according to the numerical distribution characteristics of the probability vectors in a rectangular region around the center of the specific sound-producing object. The generated selection box is used for intercepting a screenshot of a specific sound-producing object in each frame of image so as to perform specific sound-producing object recognition in more detail in the second stage. It should be explained that: all involved neural networks in this step are from the pre-trained Fast-RCNN networks in the object recognition library in python language, the TensorFlow deep learning framework.

The embodiment obtains the modularized specific sounding object, and correspondingly, each layer of deep convolutional neural network embedded in object recognition by adopting the modularized design is adopted. The used deep convolutional neural network can be used for switching the required one-level deep neural network in all levels of object recognition at will to adapt to special use scenes or special object classes, for example, the recognition network for carrying out the refined classification on shoes and the ground is not based on any pre-trained CNN model. The modular design can be expanded to embed a plurality of deep convolutional neural networks in each stage of recognition, and the accuracy of overall object recognition, the positioning precision and the recognition accuracy of refined classification are improved by utilizing an Ensemble Learning (Ensemble Learning) algorithm.

For example: the integrated learning algorithm can use the confidence value of each deep neural network on the identified selection box (the closer to 1 the more the network determines the correctness of the selection box, the confidence value is the probability judgment whether the model is correct for the object identification, and can be understood as the confidence of the model on one time of object identification, and the higher the confidence is, the higher the correctness of the object identification is), to carry out weighted average on a plurality of selection boxes, thereby finely adjusting a more reliable selection box for object positioning, so as to generate a higher-quality screenshot for the identification of the subsequent steps.

In step S120: carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object;

in particular, the existing deep neural network cannot identify the details of all objects from a natural image, so that a technical solution framework of a multi-stage object identification network can be provided. In this embodiment, the multi-stage recognition analysis process follows the design concept of "coarse to fine": for each static frame image in a frame image flow, firstly, a primary deep neural network is utilized to perform preliminary analysis and identification processing to obtain general types of specific sound-producing objects (such as characters, shoes and doors and windows), and then, for detailed screenshots of the positions of each object, a new neural network is utilized to perform multistage identification and analysis processing of object subdivision types to obtain the types of the specific sound-producing objects (such as whether the shoes are sports shoes, board shoes or leather shoes). The multi-stage recognition analysis processing of the embodiment can be expanded to an image recognition framework with more stages (for example, three stages or more), and generally, because the definition of a frame-extracted image used in an experiment is limited, a two-stage deep neural network is adopted to perform two-stage recognition analysis processing, so that the currently required functions can be realized.

Here, the process of performing secondary recognition analysis processing by a secondary deep neural network is mainly described: the preliminary identification analysis processing adopts a first-level deep identification network which is derived from a pre-trained Fast-RCNN network; the multistage recognition analysis processing adopts a multistage depth recognition network, and a secondary depth recognition network of the secondary recognition analysis processing is adopted, and is used for carrying out further detailed recognition on individual key objects recognized by the first-stage depth recognition network, for example, for the 'shoes' recognized by the first-stage depth recognition network in a static frame image, the secondary depth recognition network carries out secondary recognition analysis processing on screenshots of the 'shoes' part so as to judge the 'shoe types' and the 'ground types'. More specifically, the present embodiment can recognize four different kinds of detailed footwear (sports shoes, leather shoes, high-heeled shoes, others), and five different kinds of detailed floors (tile floors, plank floors, cement floors, sand floors, others). The specific network architecture of the two-level depth recognition network is designed based on a depth residual error network (Resnet50) with 50 layers. See the following depth residual network model acquisition process:

s121, acquiring a plurality of images containing specific sounding objects, and eliminating unqualified images of the specific sounding objects to obtain qualified images of the specific sounding objects;

s122, preprocessing the image of the qualified specific sounding object to obtain an image data set of the qualified specific sounding object, and dividing the image data set into a training set and a verification set;

and S123, inputting the training set into the initial depth residual error network model for training, and verifying the training result through the verification set to obtain a depth residual error network model capable of acquiring the type of the specific sound-producing object.

In the prior art, a depth residual error network pre-trained for identifying shoes or the ground or other specific sounding objects does not exist, the depth residual error network used in the embodiment is not based on any pre-training parameter, the network parameter of the depth residual error network is completely originally trained from random numbers, all image sets required by training are from screenshots of actual videos, and manual calibration is carried out on the types of the shoes and the ground. The image training set at least comprises 17000+ pictures with different sizes, variable aspect ratio and maximum resolution ratio of not more than 480p, the main body is the picture of other specific sounding objects of which the total deterioration of the shoes and the ground fiddle is, and in the training depth residual error network model, unqualified images, such as the pictures which are very fuzzy and the objects in the pictures are incomplete, need to be removed, and the remaining qualified images are divided into a training set and a verification set. The pictures are different from the disclosed image recognition data set, and are mostly low-resolution pictures with non-square shapes, which considers that the shapes of the screenshots of the video frames in the actual using scene are irregular, the resolution can also be reduced due to a video compression algorithm, and the irregularity and the low resolution can be understood as noise contained in the image set, so that the network trained on the data set has stronger anti-noise capability and optimized pertinence to the footwear and the ground. The recognition accuracy (calculated on a test set) of five kinds of refinement on the ground obtained by the deep residual error network of the embodiment reaches 73.4%, which is much higher than that of random selection (20%) and crowd selection (35.2%); the recognition precision of the four types of shoes is also in the same order; the actual recognition speed can reach 100 pictures per second using a single great P100 display card.

And additionally deepens a Multi-layer perceptron (inherent in the Resnet50) at the end of the network into two layers, and matches with a random deactivation design (Dropout is 0.5) to adapt to the type requirements of the identification categories required by various specific objects, so that the overfitting phenomenon caused by excessive network parameters can be avoided to a certain extent (the identification effect on the training set is far better than that of the test set).

The depth residual error network (Resnet50) adopted in the embodiment is based on the existing depth residual error network and is trained correspondingly, so that the type of a specific sound-producing object required by the embodiment can be identified, that is, the calculation identification process of a single picture and the specific use scene are modified correspondingly, the depth residual error network (Resnet50) can read a square RGB image with the pixel value not lower than 224 × 224, and for an input image with a rectangular shape and with a length and a width not being 224 pixels, the embodiment adopts a conventional linear interpolation method to firstly deform the input image into a regular floating point matrix of 224 × 224 × 3 (three RGB color channels); after the matrix is input into a network, the matrix is transformed into feature maps (feature maps) with higher abstraction degree and smaller size through a series of convolution blocks; the convolution block is a basic unit of a conventional design of a Convolutional Neural Network (CNN), the convolution block used in the respet 50 is composed of three to four two-dimensional convolution layers (2 dconvolume) in combination with a random inactivation design (drop), a batch normalization layer (batch normalization), and a linear rectification layer (ReLU), and a residual path (residual layer, which only contains a simple one-layer two-dimensional convolution layer or is a simple copy of an input matrix) is parallel to each block. And respectively calculating the characteristic diagram output by the previous block through a residual error path and a convolution block path, then outputting the characteristic diagram into two new matrixes with consistent dimensionalities, and simply adding the two matrixes to form an input matrix of the next block. The numbers in the name of the depth residual network (Resnet50) refer to a total of 50 two-dimensional convolutional layers contained in all convolutional blocks. The depth residual error network after passing through all the convolution blocks outputs 2048-dimensional first-order vectors, and then outputs vectors with dimension of 1000 through a layer of Perceptron (Perceptron). Each element value of the final output vector of the depth residual error network represents the probability value of the image belonging to a certain category, and the final category calibration is determined by the maximum probability value. Common depth residual networks similar to Resnet50 are also Resnet34, Resnet101, etc.; other common image recognition networks include Alexnet, VGGnet, and implicit net, which are also applicable in this embodiment, but the effect is not good, so a depth residual error network (Resnet50) is selected.

In addition, the secondary recognition network architecture, i.e., the deep residual error network (Resnet50), in the present embodiment simultaneously supports the feedback learning mode: when the recognition accuracy of the secondary depth recognition network does not meet the scene requirement, the frame image stream can be subjected to screenshot through an object selection box recognized by the primary depth recognition network, the screenshot is used as a new data set to be manually calibrated, and the secondary depth recognition network, namely a depth residual error network (Resnet50) is finely adjusted. Therefore, when the video content to be processed is changed greatly, the trained model and a small amount of new data can be used for rapidly obtaining higher recognition accuracy, and the preparation period for adapting to a new application scene is shortened. The first-level depth recognition network can also be retrained in stages according to the change of the video type or the change of the application scene so as to adapt to the characteristics of new video data.

Furthermore, the specific sound-producing object information recognized by each level in the two-level depth recognition network is merged and stored in the same format. For each object the information is stored: object large class (superior network identification), object large class certainty value, object fine class (secondary depth identification network identification), object fine class certainty value, object location selection frame width height and center (measured in frame image pixels), all of which are processed further in json file format.

In one embodiment, after the step of selecting the audio corresponding to a plurality of matching scores as the matched audio of the specific sounding object according to the audio matching scores, the following steps may be further included: and mixing all the audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous.

In this embodiment, the generated audio is mixed, and after the audio file required for dubbing and the start/stop time of playing each audio file are found, all the required audio files can be read, and each audio file is converted into a uniform frequency domain signal format, so as to facilitate subsequent editing.

In the present embodiment, audio files in any common format, including wav and mp3, can be read, which improves the capability of using scenes and generalizing to other specific audio libraries.

The specific process of mixing all audio frequencies is as follows: each audio segment will be stretched or compressed intelligently to the time length needed by dubbing, and the mute part of the audio starting and ending stage is cut off firstly, so that dubbing and the picture triggering dubbing in the video can happen simultaneously, and the dubbing effect is optimal. And then checking whether the time length of the audio after the head and tail silence is eliminated is longer than the time required to be played, if so, cutting the audio to the time length required to be played for dubbing, and using a fade-out effect at the tail so as to eliminate the abrupt pause of the audio. If not, the audio is played circularly until the playing time required by dubbing, and when the audio is played circularly, the head-to-tail connection positions of the front and the rear audio sections adopt the overlapping and gradually-in and gradually-out effects with a certain time length, so that the circularly played positions are in seamless connection, the long audio section sounds natural and complete, and the user has the best hearing experience. The time length of the gradual-in and gradual-out is equal to the time length of the overlap, the time length is determined according to the audio time length through a piecewise function, if the original audio time length is less than 20 seconds, the overlap and gradual-in and gradual-out time is set to be 10% of the audio time length, so that the time length of the overlap part is moderate, the audio of the front section and the rear section can be smoothly transited, and more non-overlap parts of the short videos can be reserved to be played to users. If the original audio duration is longer than 20 seconds, the overlap and fade-in and fade-out time is set to 2 seconds, so that the long audio can be prevented from generating an unnecessarily long transition period, and non-overlapping audio can be played as far as possible.

Finally, the audio frequencies processed according to the steps are combined together, and added into the audio track of the video, and a new video file with dubbing is output, so that the whole dubbing process is completed.

Example 2: a video object sound effect searching and matching system is shown in FIG. 2 and comprises an acquisition construction module 100, a first processing module 200, a second processing module 300, a matching score acquisition module 400 and a selection matching module 500;

the acquisition and construction module 100 is configured to acquire a category of a specific sound-generating object and construct an audio frequency of the specific sound-generating object based on a video to be processed;

the first processing module 200 is configured to process the category of the specific sound generating object in the video to be processed and the audio introduction of the audio to obtain a first matching score;

the second processing module 300 is configured to obtain a BERT vector of an object type of a specific sound-generating object and a BERT vector introduced by an audio, so as to obtain cosine similarity of the BERT vector, and use the cosine similarity as a neural network matching score;

the matching score obtaining module 400 is configured to obtain a video and audio matching score based on the first matching score and the neural network matching score;

the selecting and matching module 500 is configured to select, according to the audio matching scores, the audio corresponding to the multiple matching scores as the audio matched with the specific sound object.

In one embodiment, a mixing processing module 600 may be further included to mix all audio to form a complete audio file, and add the audio file to the audio track of the video to synchronize the audio file and the video.

Simple and easy-to-use function interfaces are arranged in the audio mixing processing module 600, and dubbing videos can be generated by one key, so that the working efficiency of users is greatly improved. Although the mixing processing module 600 uses common audio tools, just as the specific mixing steps and parameters in the method are specifically designed for movies, dramas, and short videos, the silence removal and special-effect audio compression or extension methods mentioned in the method embodiments can specifically solve the above-mentioned dubbing problem of videos of a specific category, that is, the problem that the audio length in the special-effect audio library does not meet the video dubbing requirement when the audio length is many, and these specific audio processing parameters are also most suitable for this embodiment, and other technologies or audio processing parameters are not available.

In one embodiment, the acquisition build module 100 is configured to: and resolving the audio into an audio introduction and audio keywords, wherein the audio introduction is an introduction content text of the audio, and the audio keywords comprise at least three words for describing the audio, and the words for describing the audio comprise the category name of the specific sound production object and the category name of the sound production sound.

In one embodiment, the first processing module 200 is configured to:

In one embodiment, the matching score obtaining module 400 is configured to:

In one embodiment, the acquisition build module 100 is configured to:

All modules in the video object sound effect searching and matching system can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or be independent of a processor of the computer device or the mobile terminal, and can also be stored in a memory of the computer device or the mobile terminal in a software form, so that the processor can call and execute operations corresponding to the modules.

For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Example 3:

In one embodiment, a processor, when executing a computer program, effects parsing of audio into an audio introduction and audio keywords, wherein the audio introduction is an introduction content text of the audio, and the audio keywords include at least three audio-describing words that include a category name of a particular spoken object and a category name of a spoken sound.

In one embodiment, when the processor executes the computer program, the processing of the category and the audio introduction of the specific sound object in the video to be processed is implemented to obtain the first matching score, specifically:

In one embodiment, when the processor executes the computer program, the obtaining of the video and audio matching score based on the first matching score and the neural network matching score is implemented by:

In one embodiment, when the processor executes the computer program, the obtaining of the category of the specific sound generating object and the constructing of the audio of the specific sound generating object based on the video to be processed are specifically:

In one embodiment, after the step of selecting, according to the audio matching scores, the audios corresponding to the several matching scores as the matched audios of the specific sound-producing object is implemented by a processor executing a computer program, the method further includes the following steps:

Example 4:

in one embodiment, a video object sound effect search matching device is provided, and the video object sound effect search matching device can be a server or a mobile terminal. The video object sound effect searching and matching device comprises a processor, a memory, a network interface and a database which are connected through a system bus. Wherein, the processor of the video object sound effect searching and matching device is used for providing calculation and control capability. The memory of the video object sound effect searching and matching device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database is used for storing all data of the video object sound effect searching and matching device. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize the searching and matching method of the sound effect of the video object.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A method for searching and matching sound effects of video objects is characterized by comprising the following steps:

2. The method for searching and matching audio effects of video objects according to claim 1, wherein the audio is parsed into an audio introduction and audio keywords, wherein the audio introduction is an introduction text of the audio, and the audio keywords comprise at least three words describing the audio, and the words describing the audio comprise a category name of a specific sound-producing object and a category name of a sound-producing sound.

3. The video object sound effect search matching method according to claim 2, wherein the category and the audio introduction of a specific sound object in the video to be processed are processed to obtain a first matching score, specifically:

4. The video object sound effect search matching method according to claim 1, wherein the video and audio matching score is obtained based on the first matching score and the neural network matching score, and specifically comprises:

5. The video object sound effect search matching method according to claim 1, wherein the obtaining of the category of the specific sound generating object and the construction of the audio of the specific sound generating object based on the video to be processed are specifically:

6. The method for searching and matching audio effects of video objects according to claim 1, wherein after the step of selecting the audio corresponding to a plurality of matching scores as the matched audio of a specific sound object according to the audio matching scores, the method further comprises the following steps:

7. A video object sound effect searching and matching system is characterized by comprising an acquisition construction module, a first processing module, a second processing module, a matching score acquisition module and a selection matching module;

8. The system for searching and matching audio effects of video objects according to claim 7, further comprising a mixing module for mixing all audio to form a complete audio file, and adding the audio file to the audio track of the video to synchronize the audio file and the video.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of one of claims 1 to 6.

10. A video object sound effect search matching apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method steps of any one of claims 1 to 6 when executing the computer program.