CN110929098A

CN110929098A - Video data processing method and device, electronic equipment and storage medium

Info

Publication number: CN110929098A
Application number: CN201911111883.2A
Authority: CN
Inventors: 李超; 马连洋; 衡阵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-03-27
Anticipated expiration: 2039-11-14
Also published as: CN110929098B

Abstract

The invention provides a video data processing method, a video data processing device, electronic equipment and a storage medium; the method comprises the following steps: acquiring a title text and a content text of a target video; detecting the sentence smoothness of the content text to obtain the sentence smoothness corresponding to the content text; based on the sentence passing degree, when a descriptive segment for describing a video picture exists in the target video, acquiring a plurality of sentence dividing texts corresponding to the content texts; the descriptive section comprises a sub-section of which the content subject is independent of the content subject of the target video; respectively carrying out similarity matching on each clause text and the title text to obtain a plurality of corresponding similarity values; and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the similarity value. By the method and the device, whether the sub-segment duration is too long in the descriptive segment of the target video can be effectively identified.

Description

Video data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for processing video data, an electronic device, and a storage medium.

Background

With the popularization of mobile terminals and the development of mobile social media, short videos have become one of the important ways for users to obtain information, entertainment and the like as a main product line of current information streams. In order to facilitate a user to better understand the content of the short video, the short video usually includes a descriptive section (i.e., a comment) introducing the video content, however, there may be a descriptive sub-section (i.e., a matting) that is irrelevant to the video content in the comment, and the relevant technology cannot determine the relative relationship between the matting duration and the video duration, and thus cannot effectively identify whether the short video is too long in matting, which brings a bad experience to the user.

Disclosure of Invention

The embodiment of the invention provides a video data processing method and device, electronic equipment and a storage medium, which can effectively identify whether short video is too long.

The embodiment of the invention provides a video data processing method, which comprises the following steps:

acquiring a title text and a content text of a target video;

detecting the sentence smoothness of the content text to obtain the sentence smoothness corresponding to the content text;

based on the sentence passing degree, when a descriptive segment for describing a video picture exists in the target video, acquiring a plurality of sentence dividing texts corresponding to the content texts; the descriptive section comprises a sub-section of which the content subject is independent of the content subject of the target video;

respectively carrying out similarity matching on each clause text and the title text to obtain a plurality of corresponding similarity values;

and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the similarity value.

An embodiment of the present invention provides a video data processing apparatus, including:

the first acquisition module is used for acquiring a title text and a content text of a target video;

the detection module is used for detecting the sentence smoothness of the content text to obtain the sentence smoothness corresponding to the content text;

a second obtaining module, configured to obtain, when it is determined that a descriptive section for describing a video picture exists in the target video based on the sentence passing degree, a plurality of clause texts corresponding to the content text; the descriptive section comprises a sub-section of which the content subject is independent of the content subject of the target video;

the matching module is used for respectively carrying out similarity matching on each clause text and the title text to obtain a plurality of corresponding similarity values;

and the determining module is used for determining the relative relation between the duration of the sub-segment in the descriptive segment and the duration of the target video based on the similarity value.

In the above scheme, the detection module is further configured to perform clause processing on the content text to obtain a plurality of corresponding clause texts;

inputting each sentence text into a sentence smoothness detection model respectively to obtain a first sentence smoothness score corresponding to the sentence text;

and weighting the first sentence smoothness scores corresponding to the sentence dividing texts to obtain second sentence smoothness scores corresponding to the content texts, wherein the second sentence smoothness scores are used for representing the sentence smoothness of the content texts.

In the above scheme, the second obtaining module is further configured to obtain a statement smoothness reference score;

acquiring the ratio of the second sentence smoothness score to the sentence smoothness reference score;

when the ratio is larger than a ratio threshold value, determining that a descriptive section for describing a video picture exists in the target video.

In the above scheme, the matching module is further configured to perform vector conversion on the title text to obtain a corresponding title vector;

respectively carrying out vector conversion on each sentence text to obtain corresponding text vectors;

and respectively carrying out similarity matching on each text vector and the title vector to obtain corresponding similarity values.

In the foregoing solution, the determining module is further configured to rank the similarity values based on a sequence of each clause text in the content text to obtain a first sequence including a first number of similarity values and a second sequence including a second number of similarity values;

and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the first sequence and the second sequence.

In the foregoing solution, the determining module is further configured to extract a maximum similarity value from the first sequence as a first similarity value, and extract a maximum similarity value from the second sequence as a second similarity value;

comparing the first similarity value with the second similarity value to obtain a comparison result;

and determining the relative relation between the time length of the sub-segments in the descriptive segments and the time length of the target video based on the comparison result.

In the foregoing scheme, the determining module is further configured to perform weighted averaging on the similarity values of the first number to obtain a corresponding third similarity value, and perform weighted averaging on the similarity values of the second number to obtain a corresponding fourth similarity value;

comparing the third similarity value with the fourth similarity value to obtain a comparison result;

In the above scheme, the determining module is further configured to sort the similarity values based on the sequence of each clause text in the content text to obtain a corresponding similarity value sequence;

sequentially comparing the similarity values in the similarity sequence with a similarity threshold value, and determining the sequence number of the first similarity value exceeding the similarity threshold value in the similarity value sequence;

and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the sequence number and the similarity value sequence.

In the above scheme, the apparatus further includes a recommending module, where the recommending module is configured to obtain a ratio of a duration of the sub-segment in the descriptive segment to a duration of the target video;

and when the ratio does not exceed a proportion threshold, adding the target video into a video library to be recommended.

An embodiment of the present invention provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video data processing method provided by the embodiment of the invention when executing the executable instructions stored in the memory.

The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute the executable instructions so as to realize the video data processing method provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

the method comprises the steps of detecting sentence smoothness of a content text of a target video, determining whether a descriptive section for describing a video picture exists in the target video, when the descriptive section exists, performing sentence division processing on the content text to obtain a plurality of sentence division texts corresponding to the content text, respectively performing similarity matching on each sentence division text and a title text of the target video, determining the relative relation between the duration of the sub-section in the descriptive section and the duration of the target video, and further effectively identifying whether the target video is too long.

Drawings

Fig. 1 is a schematic diagram of an alternative architecture of a video data processing system according to an embodiment of the present invention;

fig. 2 is an alternative structural schematic diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an alternative video data processing method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating obtaining semantic representations of texts according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a target video recommendation system according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of an alternative video data processing method according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of video watching according to an embodiment of the present invention;

fig. 8 is a schematic flow chart of an alternative video data processing method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third \ fourth" are only to distinguish between similar objects and do not denote a particular order or importance to the objects, it being understood that "first \ second \ third \ fourth" may, where permissible, be interchanged in a particular order or sequence to enable embodiments of the invention described herein to be practiced in other than that shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (TTS) and voiceprint recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The automatic driving technology generally comprises technologies such as high-precision maps, environment perception, behavior decision, path planning, motion control and the like, and the self-determined driving technology has wide application prospect,

with the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence natural language processing and the like, and is specifically explained by the following embodiment:

the inventor of the invention finds that the technologies of realizing matching between texts in the related technology mainly include similarity calculation, cross matching, interactive matching and the like in the process of implementing the embodiment of the invention. The similarity calculation is mainly a method of vectorizing texts to be matched and then calculating the similarity between vectors corresponding to the texts, but the method is more suitable for the cases that all the texts are short sentences, because the vectors of the short sentences can sufficiently represent semantic information. The cross matching needs to realize local information matching between matched texts, and has a remarkable effect on a local information sensitive Natural Language Processing (NLP) task. Interactive matching generally uses a twin network to interpret information of texts needing matching, and information sharing is realized between structural layers, so that the interactive matching is suitable for matching between long texts.

Because the title of the short video belongs to the short text (generally within 40 words), the "audio-text" (i.e. the content text) of the short video belongs to the long text (generally over 300 words), and the matching method between the texts cannot be applied to the matching between the long text and the short text, the matching between the long text and the short text is the core difficulty of the current matching algorithm, in the scene of the short video matting process, how to construct an appropriate matching method based on the title text and the content text of the short video is the key of the whole problem, and no mature method for solving the problem of the overlong short video matting exists in the current industry.

In view of this, an embodiment of the present invention provides a method for processing video data, where sentence smoothness detection is performed on a content text of a target video, so as to determine whether a descriptive section for describing a video picture exists in the target video, when the descriptive section exists, a sentence division processing is performed on the content text, so as to obtain a plurality of sentence division texts corresponding to the content text, and similarity matching is performed on each sentence division text and a title text of the target video, so as to determine a relative relationship between a duration of the sub-section in the descriptive section and a duration of the target video, so as to implement proper matching between a long text and a short text, and further effectively identify whether the target video is too long.

Referring to fig. 1, fig. 1 is an alternative architecture diagram of a video data processing system 100 according to an embodiment of the present invention, in order to support an exemplary application, a user terminal 400 (illustratively, a terminal 400-1, a terminal 400-2, and a terminal 400-N) is connected to an information streaming platform 200 through a network 300, where the terminal 400-1 is located at a short video distribution side, the terminal 400-2 and the terminal 400-N are located at a short video receiving side, and the network 300 may be a wide area network or a local area network, or a combination of both, and uses a wireless link to implement data transmission.

As shown in fig. 1, a user opens an application client of a user terminal 400-1, issues a recorded target video, and sends video data of the target video to the information streaming platform 200, where the video data includes title text and content text. The information flow platform 200 is configured to obtain a title text and a content text of a target video, perform sentence smoothness detection on the content text to obtain a sentence smoothness of the corresponding content text, and obtain a plurality of clause texts corresponding to the content text when it is determined that a descriptive section for describing a video picture exists in the target video based on the sentence smoothness, where the descriptive section includes a sub-section whose content theme is independent of the content theme of the target video; respectively carrying out similarity matching on each clause text and the title text to obtain a plurality of corresponding similarity values; and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the similarity value.

In practical application, whether the bedding duration of the target video is too long or not can be determined based on the relative relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video, for example, the ratio of the duration of the sub-segment in the descriptive segment to the duration of the target video can be obtained, when the obtained ratio does not exceed a proportional threshold, the target video is determined to be the video with no bedding length, and the target video is added to a video library to be recommended to the terminals 400-2 to 400-N corresponding to other users.

Referring to fig. 2, fig. 2 is an optional schematic structural diagram of an electronic device 200 according to an embodiment of the present invention, and taking the electronic device as an information flow platform 200 as an example, the electronic device 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in terminal 200 are coupled together by a bus system 240. It will be appreciated that the bus system 250 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, an exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.

In some embodiments, the video data processing apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 2 shows a video data processing apparatus 255 stored in a memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: the first obtaining module 2551, the detecting module 2552, the second obtaining module 2553, the matching module 2554 and the determining module 2555 are logical and thus can be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the video data processing apparatus provided in the embodiments of the present invention may be implemented in hardware, and for example, the video data processing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the video data processing method provided in the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The following describes a method for processing video data according to an embodiment of the present invention, with reference to an exemplary application of the method for processing video data according to an embodiment of the present invention, when the method is implemented as an information stream platform.

Referring to fig. 3, fig. 3 is an alternative flowchart of a video data processing method according to an embodiment of the present invention, which will be described with reference to the steps shown in fig. 3.

Step 301: and the information flow platform acquires the title text and the content text of the target video.

An information stream is a stream of content that can be scrolled through and appears in tiles that are similar in appearance and displayed next to each other, for example, an information stream that is a compilation of selected information streams (e.g., articles or news listings) or product details (e.g., product listings, service listings, etc.). In practical application, each user using the news client contacts with a product form of information flow, the information flow product has mass information, the energy source continuously refreshes new and real-time contents, and the appropriate contents can be provided for the user in an appropriate scene.

In practical implementation, information flow products received by users, such as viewpoint videos, are recommended through manual operation or recommendation algorithms, and in a big data era, because the content updated by media is massive, manual operation is often limited to hot content, therefore, an information flow platform must rely on target video data in information flow to construct an algorithm model to recommend the content of the information flow. The information flow platform utilizes specific field information points in the target video data, such as title texts and content texts, to construct an algorithm model so as to judge whether the target video is suitable for being recommended to a user, wherein the content texts correspond to 'audio-text' of the target video and are obtained by performing text conversion on audio data of the target video.

Step 302: and carrying out sentence smoothness detection on the content text to obtain the sentence smoothness of the corresponding content text.

In practical application, the target video includes a video picture and a descriptive segment describing the video picture, and since the content text sentence of the target video including the descriptive segment is smooth and can form a complete sentence, the content text of the target video not including the descriptive segment corresponds to the recognition result of the background sound, the sentence is not smooth and cannot form a complete sentence.

In actual implementation, before performing sentence continuity detection on a content text, a sentence continuity detection model needs to be trained, and when training the model, a large number of sample texts are used as a training set, wherein the sample texts are all standard texts containing descriptive segments, and the sentence continuity detection model is obtained by training the sample texts by using tools of training Language models such as Stanford research institute Language Modeling Toolkit (SRILM, Stanford research Language Modeling Toolkit) and KenLM.

In some embodiments, the information flow platform may perform sentence smoothness detection on the content text in the following manner to obtain a sentence smoothness corresponding to the content text:

sentence dividing processing is carried out on the content texts to obtain a plurality of corresponding sentence dividing texts; respectively inputting each clause text into a sentence smoothness detection model to obtain a first sentence smoothness score corresponding to the clause text; and weighting the first sentence passing degree scores corresponding to the sentence dividing texts to obtain second sentence passing degree scores corresponding to the content texts, wherein the second sentence passing degree scores are used for representing the sentence passing degree of the content texts.

The method includes dividing a relatively long content text into a plurality of short clause texts based on punctuation marks in the content text, inputting the obtained clause texts into a trained sentence smoothness detection model to obtain a plurality of corresponding sentence smoothness scores, and taking an average value of the sentence smoothness scores corresponding to the clause texts as the smoothness score of the content text.

For example, suppose that a content text of a target video is subjected to clause processing to obtain 8 clause texts, and the 8 clause texts are respectively input into a trained sentence smoothness detection model to obtain sentence smoothness scores of corresponding clause texts: [ S ]₁，S₂，S₃，S₄，S₅，S₆，S₇，S₈]Then, the sentence smoothness score of the content text of the target video is: (S) ═ S₁+…+S₈) And 8, the sentence openness score S is used for representing the sentence openness of the content text of the target video.

Step 303: based on the sentence passing degree, when a descriptive segment for describing a video picture exists in the target video, obtaining a plurality of clause texts corresponding to the content texts; the descriptive section includes a sub-section whose content subject is independent of the content subject of the target video.

In some embodiments, the information flow platform may determine that a descriptive segment describing a video picture is present in the target video by:

obtaining a sentence smoothness reference score; acquiring the ratio of the second sentence smoothness score to the sentence smoothness reference score; and when the ratio is larger than the ratio threshold value, determining that the descriptive section for describing the video picture exists in the target video.

Here, the sentence smoothness reference score is an average value of sentence smoothness scores obtained by inputting a preset number of sample texts into the trained smoothness detection model, and the sentence smoothness reference score is assumed to be S₀Comparing the sentence smoothness score S of the corresponding text content with the sentence smoothness reference score S₀To determine whether the target video contains descriptive segments.

For example, based on empirical knowledge, when S and S₀The difference between them is greater than 20%, i.e. S/S₀<0.8, the target video is considered to have no descriptive section for describing the video picture; when S and S₀The difference between them is less than or equal to 20%, i.e. S/S₀>When the value is 0.8, it is considered that a descriptive section for describing a video picture exists in the target video.

In practical applications, when a descriptive segment for describing a video picture exists in a target video, a sub-segment with a content theme independent of the content theme of the target video may be further included in the descriptive segment, where the sub-segment refers to a segment with a relatively small correlation with the content theme of the target video, for example, a descriptive segment for introducing a beijing culture exists in the target video, and a segment for introducing a beijing traffic or environment that is not related to the beijing culture, such as beijing traffic or environment, exists before introducing the beijing culture, and then the segment for introducing the beijing traffic or environment may be considered as the sub-segment.

Step 304: and respectively carrying out similarity matching on each sentence text and the title text to obtain a plurality of corresponding similarity values.

In some embodiments, the information flow platform may obtain the corresponding plurality of similarity values by:

carrying out vector conversion on the title text to obtain a corresponding title vector; respectively carrying out vector conversion on each sentence text to obtain corresponding text vectors; and respectively carrying out similarity matching on each text vector and the title vector to obtain corresponding similarity values.

In practical application, if complex texts are to be understood, the texts need to be encoded to be a language which can be read and understood by a computer, during encoding, it is desirable that similar lines among words are kept among sentences, and vector representation of the words is a basis for machine learning and deep learning. Therefore, in order to obtain a semantic representation including rich semantic analysis, the header text and each sentence text are respectively input into a general semantic representation model such as bert (bidirectional encoding retrieval from transformations).

Referring to fig. 4, fig. 4 is a schematic flow chart of obtaining semantic representation of a text according to an embodiment of the present invention, and as shown in fig. 4, a one-dimensional vector of each word/word in the text is used as an input of a BERT model, and a vector representation fused with full-text semantic information corresponding to each input word is obtained after being processed by the BERT model. Therefore, the title text is input into the BERT model to obtain a corresponding title vector; respectively inputting each sentence text into a BER T model to obtain corresponding text vectors; and then, respectively carrying out similarity matching on the title vector and the text vector corresponding to each clause text to obtain corresponding similarity values.

Step 305: and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the similarity value.

In some embodiments, the information flow platform may determine the relative relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video by:

sequencing the similarity values based on the sequence of each clause text in the content text to obtain a first sequence containing a first number of similarity values and a second sequence containing a second number of similarity values; and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the first sequence and the second sequence.

In practical implementation, firstly, the similarity values between the title vector of the title text and the text vectors of the clause texts are arranged according to the arrangement sequence of the clause texts in the content text to obtain a corresponding similarity sequence; next, a first sequence comprising a first number of similarity values and a second sequence comprising a second number of similarity values may be obtained sequentially; the similarity sequence can also be segmented according to the empirical value to obtain a first sequence containing a first number of similarity values and a second sequence containing a second number of similarity values; and finally, determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the first sequence and the second sequence.

For example, the content text of the target video has 10 clause texts in total, and the similarity value sequence consisting of the similarity values between each clause text and the title text is obtained by calculating the cosine similarity between the text vector of each clause text and the title vector of the title text as follows: [ score)₁，score₂，...，score₁₀]Wherein, s core₁Score is the cosine similarity of the first sentence text and the title text₂The cosine similarity between the second sentence text and the title text is obtained, and so on. Empirically, the first three tenths of the sequence of similarity values can be taken as the first sequence: [ score)₁，score₂，score₃]The second tenth of the sequence of similarity values is taken as the second sequence: [ score)₄，score₅，...，score₁₀]It should be noted that, in addition to the above, the similarity value sequence may be divided into two sequences in other feasible manners.

In some embodiments, the information flow platform may determine the relative relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video based on the first sequence and the second sequence by:

extracting a maximum similarity value from the first sequence as a first similarity value, and extracting a maximum similarity value from the second sequence as a second similarity value; comparing the first similarity value with the second similarity value to obtain a comparison result; and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the comparison result.

Here, also in the above-described first sequence: [ score)₁，score₂，score₃]And the second sequence: [ score)₄，score₅，...，score₁₀]To illustrate by way of example, the largest similarity value is extracted from the first sequence as the first similarity value top ═ max ([ score)₁，score₂，score₃]) The largest similarity value is extracted from the second sequence as the second similarity value end ([ score ═ score)₄，score₅，...，score₁₀]) Obtaining a ratio a of the first similarity value top to the second similarity value end, which is top/end, where a is a relative relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video, and the larger a is, the shorter a duration of the sub-segment in the descriptive segment is relative to the duration of the target video is represented, because the descriptive segment is a segment for describing a video picture of the target video, and the sub-segment is a segment having no correlation with a content subject of the target video, the larger a is, which also means that the descriptive sub-segment having no correlation with the content subject of the target video exists in the target video, and the shorter a is.

In some embodiments, the information flow platform may further determine a relative relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video based on the first sequence and the second sequence by:

carrying out weighted averaging on the similarity values of the first quantity to obtain a corresponding third similarity value, and carrying out weighted averaging on the similarity values of the second quantity to obtain a corresponding fourth similarity value; comparing the third similarity value with the fourth similarity value to obtain a comparison result; and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the comparison result.

Here, also in the above-described first sequence: [ score)₁，score₂，score₃]And the second sequence: [ score)₄，score₅，...，score₁₀]For example, the similarity values of the first number in the first sequence are weighted and averaged to obtain the corresponding third similarity value sim₁＝(score₁+score₂+score₃) A/3; carrying out weighted averaging on the similarity values of the second number in the second sequence to obtain a corresponding fourth similarity valuesim₁＝(score₄+...+score₁₀) (7) obtaining a third similarity value sim₁Similarity to fourth similarity value sim₂Ratio b ═ sim₁/sim₂B is the relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video, and the larger b is, the shorter b represents the duration of the sub-segment in the descriptive segment relative to the duration of the target video.

sequencing the similarity values based on the sequence of each clause text in the content text to obtain a corresponding similarity value sequence; sequentially comparing the similarity values in the similarity sequence with a similarity threshold value, and determining the sequence number of the first similarity value exceeding the similarity threshold value in the similarity sequence; and determining the relative relation between the time length of the sub-segment in the descriptive segment and the time length of the target video based on the sequence number and the similarity value sequence.

Here, the similarity value sequence, which is also composed of 10 clause texts in total and similarity values between each clause text and the title text in the content text of the above target video, is as follows: [ score)₁，score₂，...，score₁₀]The description is given for the sake of example. Assuming that the similarity threshold is score, sequentially comparing the similarity values score in the similarity sequence_iComparing with a similarity threshold score, and if the similarity value is greater than the similarity threshold, considering that the corresponding clause text is a text for describing a video picture of the target video, namely a text related to the subject content; if the similarity value is smaller than the similarity threshold value, the corresponding clause text is not considered to be the text for describing the video picture of the target video, namely the text irrelevant to the subject content.

More specifically, assume that the similarity sequence is: 0.12, 0.2, 0.3, 0.4, 0.8, 0.9, 0.8, 0.4, the similarity threshold value is 0.7, then the similarity values in the similarity sequence are compared with the similarity threshold value in sequence, the sequence number of the first similarity value exceeding the similarity threshold value in the similarity value sequence is known to be 6, namely the similarity value between the 6 th clause text and the title text in the similarity sequence exceeds the similarity threshold value, then the 6 th clause text can be regarded as the text related to the subject matter of the target video, and the first five clause texts are the texts unrelated to the subject matter of the target video, namely the first five clause texts are descriptive sub-segments, then the relative relationship between the duration of the sub-segments in the descriptive segment and the duration of the target video can be determined to be 5/10, namely, the duration of the sub-segments in the descriptive segment accounts for half of the duration of the target video, it can be seen that the larger the ratio of the sequence number of the first similarity value exceeding the similarity threshold in the similarity value sequence to the number of the similarity value sequence of the target video, the longer the duration of the sub-segment in the representative descriptive segment relative to the duration of the target video.

In some embodiments, the information flow platform may further obtain a ratio of a duration of the sub-segment in the descriptive segment to a duration of the target video; and when the ratio does not exceed the ratio threshold, adding the target video into a video library to be recommended.

In practical application, whether the duration of the sub-segment in the descriptive segment in the target video is too long or not can be determined according to the relative relation between the duration of the sub-segment in the descriptive segment and the duration of the target video, namely whether too long description irrelevant to the subject content of the target video exists in the target video or not is determined, and when the situation that too long description irrelevant to the subject content of the target video does not exist in the target video is determined, the target video is stored in a video library to be recommended so as to be recommended to a user; and when the target video is determined to have too long description which is irrelevant to the subject content of the target video, setting the target video as not recommended.

For example, if the a ═ top/end is less than or equal to 0.8, it is determined that the target video has too long description irrelevant to the subject content of the target video, and the target video is set as not recommended; if the a is larger than 0.8, determining that the target video does not have too long description irrelevant to the subject content of the target video, and storing the target video in a video library to be recommended to recommend the target video to the user.

Referring to fig. 5, fig. 5 is a schematic diagram of a recommendation system for a target video provided by an embodiment of the present invention, as shown in fig. 5, after video data of the target video is processed by an information flow platform according to a processing method of the video data provided by the embodiment of the present invention, a relative relationship between a time length of a sub-segment in a descriptive segment and a time length of the target video is obtained, whether the target video has too long description irrelevant to subject content of the target video is determined based on the relative relationship, and when it is determined that the target does not have too long description irrelevant to subject content of the target video, the target video is considered to meet a recommendation condition, and the target video is stored in a video library to be recommended to be pushed to an information flow product such as a browser or a flash newspaper.

By the method, sentence smoothness detection is carried out on the content text of the target video, whether descriptive fragments used for describing video pictures exist in the target video is determined, when the descriptive fragments exist, sentence processing is carried out on the content text, a plurality of sentence texts corresponding to the content text are obtained, similarity matching is carried out on each sentence text and the title text of the target video respectively, and the relative relation between the duration of the sub-fragments in the descriptive fragments and the duration of the target video is determined, so that proper matching between the long text and the short text is realized, whether the target video is overlong is effectively identified, and when the fact that the target video is overlong is identified, namely, the long description irrelevant to the subject content of the target video exists in the descriptive fragments of the target video, the target video is set as not recommended; when the fact that the bedding of the target video is not long is recognized, the target video is recommended to the user, therefore, the user can find the interest points when watching the received target video, and user experience is improved.

Next, a description is continued on a video data processing method provided in an embodiment of the present invention, where the video data processing method is implemented by a terminal or an information flow platform, or is implemented by an information flow platform and a terminal in a cooperative manner, an application client is disposed on the terminal, and the information flow platform is implemented as an example, fig. 6 is an optional flowchart of the video data processing method provided in the embodiment of the present invention, and referring to fig. 6, the video data processing method provided in the embodiment of the present invention includes:

step 601: the first client side responds to uploading operation of a user for the target video and acquires the target video.

Here, the first client is located on the target video distribution side, and in practical application, the user opens the first client on the user terminal, records and distributes the target video, or distributes the recorded target video.

Step 602: and the application client sends the target video data to the information flow platform.

Step 603: and the information flow platform acquires the title text and the content text of the target video.

Here, the information flow platform constructs an algorithm model by relying on the target video data in the information flow to recommend the content of the information flow. The information flow platform utilizes specific field information points in the target video data, such as title texts and content texts, to construct an algorithm model so as to judge whether the target video is suitable for being recommended to a user, wherein the content texts correspond to 'audio-text' of the target video and are obtained by performing text conversion on audio data of the target video.

Step 604: and the information flow platform detects the sentence smoothness of the content text to obtain the score of the sentence smoothness of the corresponding content text.

Step 605: and the information flow platform acquires the statement smoothness reference score.

Here, the sentence smoothness reference score is an average value of sentence smoothness scores obtained by inputting a preset number of sample texts into a trained smoothness detection model, where the sample texts are all standard texts containing descriptive segments.

Step 606: and the information flow platform acquires the ratio of the second statement smoothness score to the statement smoothness reference score.

Step 607: when the ratio is larger than the ratio threshold value, the information flow platform determines that the descriptive segment for describing the video picture exists in the target video.

For example, assume that the sentence smoothness reference score is S₀If the sentence smoothness score of the corresponding text content is S, the S and the S are passed₀To determine whether the target video contains descriptive segments. Based on empirical knowledge, when S and S₀The difference between them is greater than 20%, i.e. S/S₀<0.8, the target video is considered to have no descriptive section for describing the video picture; when S and S₀The difference between them is less than or equal to 20%, i.e. S/S₀>When the value is 0.8, it is considered that a descriptive section for describing a video picture exists in the target video.

Step 608: and the information flow platform acquires a plurality of clause texts corresponding to the content texts.

The descriptive section comprises a sub-section with a content subject independent from the content subject of the target video, and the sub-section refers to a section which is less or not related to the content subject of the target video and exists in the descriptive section.

Step 609: and the information flow platform respectively matches the similarity of each clause text with the title text to obtain corresponding similarity values.

Here, the title text and each sentence text are respectively input into a general semantic representation model such as a BERT model to obtain a corresponding title vector and a text vector corresponding to each sentence text, and then similarity matching is performed between the title vector and the text vector corresponding to each sentence text to obtain a corresponding similarity value.

Step 610: and the information flow platform sorts the similarity values based on the sequence of each clause text in the content text to obtain a first sequence containing the similarity values of a first quantity and a second sequence containing the similarity values of a second quantity.

Here, the sum of the first number and the second number is the total number of similarity values, and the proportional relationship between the first number and the second number may be set according to an empirical value.

Step 611: the information flow platform extracts a maximum similarity value from the first sequence as a first similarity value and extracts the maximum similarity value from the second sequence as a second similarity value.

Step 612: and the information flow platform compares the first similarity value with the second similarity value to obtain a comparison result.

Here, the first similarity value may be divided by the second similarity value to obtain a ratio of the first similarity value to the second similarity value.

Step 613: and the information flow platform determines the relative relation between the duration of the sub-segments in the descriptive segments and the duration of the target video based on the comparison result.

Here, the larger the ratio, the shorter the duration of the sub-segment in the descriptive segment is relative to the duration of the target video, the shorter the descriptive sub-segment in the target video exists that has no correlation with the content subject of the target video; otherwise, the smaller the ratio, the longer the duration of the sub-segment in the descriptive segment relative to the duration of the target video, the longer the descriptive sub-segment in the target video exists that has no relevance to the subject matter of the content of the target video.

Step 614: and when the ratio of the time length of the sub-segment in the descriptive segment to the time length of the target video is determined not to exceed the proportional threshold, storing the target video to a video library to be recommended to recommend the target video to a second client.

Here, the second client is located at the receiving side of the target video, and when the ratio of the time length of the sub-segment in the descriptive segment to the time length of the target video is larger, the longer the descriptive sub-segment which has no correlation with the content subject of the target video exists in the target video, and the longer the descriptive sub-segment is laid; when the ratio of the time length of the sub-segment in the descriptive segment to the time length of the target video is smaller, the descriptive sub-segment in the target video without relevance to the content subject of the target video is shorter, namely, the shorter the cushion is. And when the ratio of the time length of the sub-segment in the descriptive segment to the time length of the target video does not exceed a proportional threshold, considering that the recommendation condition is met, and storing the target video meeting the push condition into a video library to be recommended to recommend the target video to an information flow product such as a browser or a flash newspaper for a user to watch.

Step 615: the second client plays the target video.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

The short video is one of the main product lines of the current information flow, and becomes one of the important ways for users to acquire information, entertainment and the like, however, whether the short video with high quality can be provided for the users as the information flow products such as 'QQ view point', 'view point video', 'QQ browser' and the like becomes one of the core pain points of the current information flow products.

Referring to fig. 7, fig. 7 is a schematic view of a video watching process provided by an embodiment of the present invention, where the video watching process of an information flow user includes:

step 701: the information flow user obtains the title of the target video.

Step 702: and determining whether the target video has the interest point according to the title of the target video.

Here, when it is determined that there is a point of interest of itself according to the title of the target video, step 703 is performed; when it is determined that there is no own point of interest according to the title of the target video, step 705 is performed.

Step 703: and entering a target video, and searching interest points from the target video.

Step 704: and judging whether the target video is too long.

Here, when the mat in the target video is too long, step 705 is performed; when the mat in the target video is not long, step 706 is performed.

Step 705: boring and not watching the target video.

Step 706: and accepting and continuing to watch the target video.

As can be seen from fig. 7, if the mat in the target video is too long, it is difficult for the user to quickly acquire the point of interest from the short video, thereby causing annoyance to the user. Due to the non-strong correlation of the information flow product, the dependence of the algorithm is larger than that of other products, and when a user uses a short video product, the cushion length tolerance of the short video is different due to different characters, environments and the like. Therefore, whether the recommendation strategy can be better formulated by the recommendation side and the high-quality short video is provided for the user becomes one of the core pain points of the current information flow product, and the related technology cannot determine the relative relation between the bedding duration and the video duration, so that whether the short video is too long in bedding or not can not be effectively identified, and bad experience is brought to the user.

Based on this, an embodiment of the present invention provides a method for processing video data, which determines whether a descriptive section (i.e., an explanation) for describing a video picture exists in a target video by performing sentence smoothness detection on a content text (i.e., an audio-text) of the target video, where the descriptive section includes a sub-section (i.e., a matting) whose content subject is independent of the content subject of the target video, and when the descriptive section exists, performs clause processing on the content text to obtain a plurality of clause texts corresponding to the content text, and determines a relative relationship between a duration of the sub-section in the descriptive section and a duration of the target video by performing similarity matching on each clause text and a title text of the target video, so as to implement proper matching between a long text and a short text, and further effectively identify whether the target video is too long as a matting.

Referring to fig. 8, fig. 8 is an optional flowchart illustrating a method for processing video data according to an embodiment of the present invention, and as shown in fig. 8, the method for processing video data according to the embodiment of the present invention includes:

step 801: and the information flow platform acquires target video data in the information flow product.

Step 802: the information flow platform acquires the title and content text of the target video which can be utilized.

Step 803: and constructing a content matching model based on the acquired title and content text of the target video.

Step 804: and obtaining a recognition result of whether the bedding of the target video is too long or not based on the constructed content matching model.

As shown in fig. 8, an overall flow of a method for processing video data according to an embodiment of the present invention includes: identifying whether the target video contains a descriptive segment (i.e. a narration) for describing a video picture, a clause text of a content text (i.e. an audio-text) of the target video, a matching model of the clause text and a title text of the target video, and a target video matting over-length decision mechanism, and introducing the following steps one by one:

1. identifying whether a target video contains commentary using a language model

In practical applications, the target video includes a video frame and a descriptive section (i.e., a comment) describing the video frame, and since the text sentence of the content of the target video including the descriptive section (i.e., the comment) is smooth and can form a complete sentence, and the text sentence of the content of the target video not including the descriptive section (i.e., the comment) corresponds to the recognition result of the background sound, the sentence is not smooth and cannot form a complete sentence. In actual implementation, the process of identifying the content text of the target video is as follows:

1) constructing a sentence smoothness detection model (namely a language model) training set by using article data which is exported from a content center history;

2) training a language model using kenlm;

3) randomly selecting 2000 pieces of short video data, and selecting 500 pieces of basic data of short video audio-to-text, wherein the basic data are required to be standard data containing commentary;

4) computing language model average score S of basic data by using trained kenlm language model₀；

5) Calculating the language model score S of the target video' voice-to-text_iIf S is_iAnd S₀If the difference is greater than 20%, the target video is determined not to contain commentary, i.e., S_i<0.8S₀Then the target video does not contain commentary.

In practical applications, when the target video contains the commentary, the commentary may contain a descriptive sub-segment (i.e. a pad) of the content subject independent of the content subject of the short video, and therefore, whether the target video contains the pad or not is detected next.

2. Clause of content text of target video

Here, the longer content text may be divided into a plurality of short clause texts based on punctuation marks in the content text (i.e., the phonetic transcription text), for example, may be implemented by a re module in python,

import re

sent_segs＝re.findall(".*？[。！？]",content)

wherein, content is content text, and send _ segs is the calculated result.

3. Vector model of sentence text and title of content text of target video

1) Inputting each clause text and title of the content text as a BERT model;

[ cls ] + "phonetic-to-text" sentence + [ seq ]

2) Calculating sentence vectors by using a BERT model, inputting the sentence vectors into a content text and a title text, and outputting the sentence vectors into vectors;

3) and taking the vector corresponding to the cls as a final text vector to be output.

Here, since the BERT model has the task of predicting words, other words need to be considered in prediction, and since cls has no obvious semantic information, it more fairly fuses semantic information of each word/word in a text. In practical application, a one-dimensional vector of each character/word in a text is used as input of a BERT model, and vector representation fused with full-text semantic information corresponding to each input character is obtained after the BERT model processing. Therefore, the title text is input into the BERT model to obtain a corresponding title vector; and respectively inputting the sentence dividing texts into a BERT model to obtain corresponding text vectors.

4. Decision mechanism for overlong target video matting

1) Respectively calculating cosine similarity between the title vector and the text vector corresponding to each clause text to obtain corresponding cosine similarity values;

2) sequencing the similarity values based on the sequence of each clause text in the content text to obtain a similarity value sequence;

3) and comparing the highest value of the similarity value of the three-tenth clause text and the title text in the similarity sequence with the highest value of the similarity value of the seven-tenth clause text and the title text to obtain a ratio, and identifying the target video with the ratio being less than or equal to 0.8 as being too long.

Here, in practical applications, the similarity sequence is empirically divided into a first sequence containing a first number of similarity values and a second sequence containing a second number of similarity values; extracting a maximum similarity value from the first sequence as a first similarity value, extracting a maximum similarity value from the second sequence as a second similarity value, and comparing the first similarity value with the second similarity value to obtain a ratio; based on the ratio, the relative relationship between the duration of the sub-segment (i.e., the pad) in the descriptive segment and the duration of the target video is determined.

For example, the content text of the target video has 10 clause texts in total, and the similarity value sequence consisting of the similarity values between each clause text and the title text is obtained by calculating the cosine similarity between the text vector of each clause text and the title vector of the title text as follows: [ score)₁，score₂，...，score₁₀]Wherein, score₁Score is the cosine similarity of the first sentence text and the title text₂The cosine similarity between the second sentence text and the title text is obtained, and so on. Empirically, the first three tenths of the sequence of similarity values can be taken as the first sequence: [ score)₁，score₂，score₃]The second tenth of the sequence of similarity values is taken as the second sequence: [ score)₄，score₅，...，score₁₀]。

The largest similarity value is extracted from the first sequence as a first similarity value top ═ max ([ score)₁，score₂，score₃]) The largest similarity value is extracted from the second sequence as the second similarity value end ([ score ═ score)₄，score₅，...，score₁₀]) Obtaining a ratio a of the first similarity value top to the second similarity value end, where a represents a relative relationship between the duration of the sub-segment in the descriptive segment and the duration of the target video, and the larger a, the larger a represents the descriptive segmentThe shorter the duration of the sub-segment is relative to the duration of the target video, since the descriptive segment is a segment for describing the video picture of the target video and the sub-segment is a segment having no correlation with the content subject of the target video, the larger a is, which also means that the shorter the descriptive sub-segment (i.e., the pad) having no correlation with the content subject of the target video exists in the target video; accordingly, the smaller a, the longer the mat in the representation target video.

If the a is less than or equal to 0.8, determining that the target video has too long description irrelevant to the subject content of the target video, namely the target video is identified as being too long, and setting the target video as not recommended; if the a is larger than 0.8, determining that the target video does not have too long description irrelevant to the subject content of the target video, namely the target video is identified as not too long, and storing the target video into a video library to be recommended to be pushed to information flow products such as a browser or a flash newspaper.

By the video data processing method provided by the embodiment of the invention, videos with overlong mats are identified from short videos, and the overlong short videos are set as unrecommended videos in information flow products (viewpoint videos, browsers and flash newspapers), so that the user experience can be effectively improved.

Continuing with the exemplary structure of the video data processing device 255 provided by the embodiment of the present invention implemented as software modules, in some embodiments, as shown in fig. 2 and 9, the software modules stored in the video data processing device 255 of the memory 250 may include: a first obtaining module 2551, a detecting module 2552, a second obtaining module 2553, a matching module 2554 and a determining module 2555.

A first obtaining module 2551, configured to obtain a title text and a content text of a target video;

a detecting module 2552, configured to perform sentence smoothness detection on the content text, to obtain a sentence smoothness corresponding to the content text;

a second obtaining module 2553, configured to, based on the sentence passing degree, obtain a plurality of sentence division texts corresponding to the content text when it is determined that a descriptive segment for describing a video picture exists in the target video; the descriptive section comprises a sub-section of which the content subject is independent of the content subject of the target video;

a matching module 2554, configured to perform similarity matching between each clause text and the title text, respectively, to obtain a plurality of corresponding similarity values;

a determining module 2555, configured to determine, based on the similarity value, a relative relationship between a duration of a sub-segment in the descriptive segment and a duration of the target video.

In some embodiments, the detection module is further configured to perform clause processing on the content text to obtain a plurality of corresponding clause texts;

In some embodiments, the second obtaining module is further configured to obtain a sentence smoothness reference score;

In some embodiments, the matching module is further configured to perform vector conversion on the caption text to obtain a corresponding caption vector;

In some embodiments, the determining module is further configured to rank the similarity values based on an order of each of the clause texts in the content text, so as to obtain a first sequence including a first number of similarity values and a second sequence including a second number of similarity values;

In some embodiments, the determining module is further configured to extract a maximum similarity value from the first sequence as a first similarity value and extract a maximum similarity value from the second sequence as a second similarity value;

In some embodiments, the determining module is further configured to perform weighted averaging on the first number of similarity values to obtain a corresponding third similarity value, and perform weighted averaging on the second number of similarity values to obtain a corresponding fourth similarity value;

In some embodiments, the determining module is further configured to sort the similarity values based on an order of each of the clause texts in the content text to obtain a corresponding similarity value sequence;

In some embodiments, the apparatus further includes a recommendation module, configured to obtain a ratio of a duration of a sub-segment in the descriptive segment to a duration of the target video;

a memory for storing executable instructions;

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method for processing video data, the method comprising:

acquiring a title text and a content text of a target video;

2. The method of claim 1, wherein the detecting the sentence smoothness of the content text to obtain the sentence smoothness corresponding to the content text comprises:

sentence dividing processing is carried out on the content texts to obtain a plurality of corresponding sentence dividing texts;

3. The method of claim 2, wherein the determining that a descriptive segment for describing a video picture exists in the target video based on the sentence order comprises:

obtaining a sentence smoothness reference score;

4. The method of claim 1, wherein said similarity matching each of said clause texts with said title text, respectively, to obtain a corresponding plurality of similarity values, comprises:

performing vector conversion on the title text to obtain a corresponding title vector;

5. The method of claim 1, wherein determining the relative relationship of the duration of the sub-segment in the descriptive segment to the duration of the target video based on the similarity value comprises:

sequencing the similarity values based on the sequence of each sentence text in the content text to obtain a first sequence containing a first number of similarity values and a second sequence containing a second number of similarity values;

6. The method of claim 5, wherein determining the relative relationship of the duration of the sub-segment in the descriptive segment to the duration of the target video based on the first sequence and the second sequence comprises:

extracting a maximum similarity value from the first sequence as a first similarity value and extracting a maximum similarity value from the second sequence as a second similarity value;

7. The method of claim 5, wherein determining the relative relationship of the duration of the sub-segment in the descriptive segment to the duration of the target video based on the first sequence and the second sequence comprises:

carrying out weighted averaging on the similarity values of the first quantity to obtain a corresponding third similarity value, and carrying out weighted averaging on the similarity values of the second quantity to obtain a corresponding fourth similarity value;

8. The method of claim 1, wherein determining the relative relationship of the duration of the sub-segment in the descriptive segment to the duration of the target video based on the similarity value comprises:

sequencing the similarity values based on the sequence of each sentence text in the content text to obtain a corresponding similarity value sequence;

9. The method of claim 1, wherein the method further comprises:

acquiring the ratio of the time length of the sub-segment in the descriptive segment to the time length of the target video;

10. An apparatus for processing video data, the apparatus comprising:

11. An electronic device for video processing, comprising a processor and a memory, the memory for storing executable instructions, the processor for retrieving the executable instructions in the memory and performing the method as claimed in any one of claims 1-9.

12. A storage medium comprising stored executable instructions, wherein the executable instructions when executed perform the method of any one of claims 1 to 9.