CN115035453A

CN115035453A - Video title and tail identification method, device and equipment and readable storage medium

Info

Publication number: CN115035453A
Application number: CN202210692408.4A
Authority: CN
Inventors: 张楠; 冯海洋; 李征; 张晓迪; 孙方明; ***
Original assignee: Beijing Shida Technology Co ltd
Current assignee: Beijing Shida Technology Co ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-09-09

Abstract

The invention provides a method, a device and equipment for identifying a video title and a readable storage medium, and relates to the technical field of video identification. The method comprises the following steps: acquiring a first time range of scene switching of the target video according to the identification information of the target video; wherein the identification information comprises at least one item of sound, image and character identification information; matching a plurality of frames of first images in a first time length before the first time in a first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result; matching first sound data in a first time length before a first time in a first time range with second sound data in the first time length after the first time to obtain a second matching result; and acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result. The scheme of the invention can efficiently and accurately identify the head and the tail of the video.

Description

Video title and tail identification method, device and equipment and readable storage medium

Technical Field

The invention belongs to the technical field of video identification, and particularly relates to a method, a device, equipment and a readable storage medium for identifying a video title and a video tail.

Background

The existing video playing software can provide options for users to skip the head and the tail of a video, so that the experience of the users is improved, and the watching time is saved. However, the current video title and end-to-end identification method generally adopts a manual labeling method or a video fingerprint technology.

According to the manual marking method, a large amount of manpower is consumed to watch videos and mark the videos by manually editing the time points of the head and the tail of the films, so that the recognition efficiency is low and the cost is high. The video fingerprint technology is used for detecting video fingerprints only in a set detection time area, then determining the time points of the head and the tail of a film according to a detection result, and depending on the video fingerprints, the detection result of the video fingerprints greatly influences the identification accuracy.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device, equipment and a readable storage medium for identifying a video title and a video title, so as to solve the problems of low efficiency and low accuracy of identifying the video title and the video title in the prior art.

In order to achieve the above object, an embodiment of the present invention provides a method for identifying a video title and a video title, including:

acquiring a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;

matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;

matching first sound data in a first time length before a first time in the first time range with second sound data in the first time length after the first time to obtain a second matching result;

and acquiring the head end time or the tail end time of the target video according to the first matching result and the second matching result.

Optionally, the method for identifying a beginning and a ending of a video slice, where the first time range of scene switching of the target video is obtained according to the identification information of the target video, includes:

dividing third sound data in a second time length before the target video into a plurality of sub sound data, or dividing third sound data in a second time length after the target video into a plurality of sub sound data;

acquiring multi-frame images in a video segment within a second time length before the target video or acquiring multi-frame images in a video segment within a second time length after the target video according to a preset interval frame number;

acquiring a first time range of scene switching of the target video based on at least one of the following items:

a degree of scene matching between the sound identification information of the plurality of sub-sound data;

scene matching degree among the image identification information of the multi-frame images;

and the scene matching degree among the character recognition information of the multi-frame images.

Optionally, the method for identifying a beginning and a ending of a video slice, where the matching is performed on multiple frames of first images in a first time length before a first time in the first time range and multiple frames of second images in the first time length after the first time to obtain a first matching result, includes:

carrying out target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;

establishing a motion trail model of the first target according to the positioning information and the image characteristic information of the first target in the multiple frames of first images;

acquiring the predicted position of the first target in the plurality of frames of second images according to the motion trail model;

and matching the position of the first target in the multiple frames of second images with the predicted position to obtain the first matching result.

Optionally, after performing target segmentation on the multiple frames of first images, the method for identifying the head and the tail of the video slice further includes:

and acquiring a target with a color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.

Optionally, the method for identifying a beginning and a ending of a video slice, where the matching between first sound data in a first time period before a first time in the first time range and second sound data in the first time period after the first time to obtain a second matching result includes:

acquiring a first scene corresponding to the first sound data and a second scene corresponding to the second sound data by identifying acoustic information and semantic information in the first sound data and the second sound data;

and matching the first scene with the second scene to obtain a second matching result.

Optionally, the method for identifying a beginning and a ending of a title of a video, according to the first matching result and the second matching result, acquiring a beginning and ending time or an ending and starting time of a title of the target video, includes:

acquiring a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;

obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;

and acquiring the end time of the title or the start time of the end of the title based on the second time range and the third time range.

Optionally, the method for identifying a video title and a video title, where the obtaining of the title end time or the title start time based on the second time range and the third time range includes:

and acquiring the end time of the leader or the end time of the trailer according to the superposition time of the second time range and the third time range.

In order to achieve the above object, an embodiment of the present invention further provides a video title and end of title recognition apparatus, including:

the first acquisition module is used for acquiring a first time range of scene switching of a target video according to the identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;

the second obtaining module is used for matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;

a third obtaining module, configured to match first sound data in a first duration before a first time in the first time range with second sound data in the first duration after the first time, and obtain a second matching result;

and the fourth acquisition module is used for acquiring the head ending time or the tail starting time of the target video according to the first matching result and the second matching result.

In order to achieve the above object, an embodiment of the present invention further provides a video title and trailer recognition apparatus, including a transceiver, a processor, a memory, and a program or instructions stored in the memory and executable on the processor; the processor, when executing the program or the instructions, implements the steps in the video title and end-of-title identification method as described in any one of the above.

In order to achieve the above object, an embodiment of the present invention further provides a readable storage medium for storing a computer program, which when executed by a processor implements the steps in the video title and title recognition method as described above.

The technical scheme of the invention at least has the following beneficial effects:

in the above-mentioned scheme of the embodiment of the present invention, according to the identification information of the target video, a first time range in which the target video is subjected to scene switching is obtained, a plurality of frames of first images in a first duration before a first time in the first time range are matched with a plurality of frames of second images in the first duration after the first time, a first matching result is obtained, first sound data in the first duration before the first time in the first time range is matched with second sound data in the first duration after the first time, a second matching result is obtained, so that a slice head end time or a slice tail start time of the target video is obtained according to the first matching result and the second matching result, that is, a slice head end time or a slice tail start time is analyzed and identified through an image identification technology and a sound identification technology, the method has the advantages of high recognition efficiency and high recognition accuracy.

Drawings

Fig. 1 is a flowchart of a video title and end of title recognition method according to an embodiment of the present invention;

fig. 2 is a schematic application diagram of a video title and end of slice identification method according to an embodiment of the present invention;

fig. 3 is a structural diagram of a video title and end of title recognition apparatus according to an embodiment of the present invention;

fig. 4 is a structural diagram of a video title and end of title recognition device according to an embodiment of the present invention.

Detailed Description

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In addition, the terms "system" and "network" are often used interchangeably herein.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

Referring to fig. 1, fig. 1 is a flowchart of a video title and trailer identification method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, acquiring a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of sound, image and character identification information;

in the step, a characteristic value can be formed according to the input voice identification information, the input image identification information and the input character identification information, then the characteristic value is input into a supervised deep learning model to determine the scene of the target video, and a first time range of scene switching of the target video is output. The voice recognition information may include music, human dialog, voice-over, etc.; the image recognition information may include people, places, environments, and the like. The first time range of scene switching of the target video can be automatically identified through the deep learning model, and the identification efficiency is improved. The deep learning model is a model trained in advance.

It should be noted that, by this step, two first time ranges of scene switching of the target video may be obtained, which correspond to the slice header and the slice trailer, respectively.

102, matching multiple frames of first images in a first time length before a first time in the first time range with multiple frames of second images in the first time length after the first time to obtain a first matching result;

in this step, according to the first time range obtained in step 101, image processing is performed on the first time range, and a matching result of frame images before and after the first time in the first time range is obtained, so that a time corresponding to a frame image whose front and rear images are not matched is obtained.

103, matching first sound data in a first time length before a first time in the first time range with second sound data in the first time length after the first time to obtain a second matching result;

in this step, according to the first time range obtained in step 101, sound processing is performed on the first time range, and a matching result of sound data before and after the first time in the first time range is obtained, so that a time corresponding to sound data whose sound before and after the first time is not matched is obtained.

And 104, acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result.

It should be noted that, in the embodiment of the present invention, based on the first matching result and the second matching result, a scene recognition result obtained by performing image processing on multiple frames of the first image and multiple frames of the second image and a scene recognition result obtained by performing sound processing on the first sound data and the second sound data are obtained, and then the scene recognition results of the image processing and the sound processing are respectively input into the comparison model, so as to output the end time of the slice header or the start time of the slice trailer.

In the embodiment of the invention, the first time range of scene switching of the target video is obtained according to the identification information of the target video, and matching a plurality of frames of first images within a first time length before the first time in the first time range with a plurality of frames of second images within the first time length after the first time to obtain a first matching result, and matching first sound data in a first time period before a first time in the first time range with second sound data in a first time period after the first time to obtain a second matching result, thereby obtaining the ending time of the title or the beginning time of the end of the title of the target video according to the first matching result and the second matching result, the method analyzes and identifies the leader ending time or the trailer starting time by an image identification technology and a voice identification technology, and has the advantages of high identification efficiency and high identification accuracy.

In an optional embodiment of the present invention, step 101 includes:

acquiring multi-frame images in a video clip within a second time length before the target video or acquiring multi-frame images in a video clip within a second time length after the target video according to a preset interval frame number;

Wherein the second duration is generally determined according to a duration of a slice header or a slice trailer of the video. The third sound data in the second time length before the target video and the video clips in the second time length before the target video are both the first time range for identifying the titles of the target video; and the third sound data in the second time length after the target video and the video clip in the second time length after the target video are both used for identifying the first time range of the end of the target video.

Dividing a plurality of sub-sound data or multi-frame images into a plurality of comparison groups, wherein each group comprises two sub-sound data or two frames of images, then inputting each comparison group into a deep learning model, and outputting scene matching degree.

And determining the corresponding time ranges of at least two sub-sound data or at least two frames of images with the scene matching degree smaller than a preset matching threshold as a first time range based on the scene matching degree of at least one of the sound identification information, the image identification information and the character identification information.

The character recognition information is obtained based on character recognition of a frame image.

That is, this step 101 may be understood as acquiring the first time range by a video fingerprinting technique.

In an alternative embodiment of the present invention, step 102 includes:

according to the motion trail model, the predicted position of the first target in the multiple frames of second images is obtained;

and matching the position of the first target in the plurality of frames of second images with the predicted position to obtain the first matching result.

It should be noted that, the image segmentation is to extract a part of pixels in an image to express a known target, where, because the start or end of a video is identified, a target that may occur in the start or end of each frame of an image in multiple frames of first images, that is, a part of pixels corresponding to the first target, is segmented.

Then, position detection is carried out on the first target on the first image after the target segmentation, and the position of the first target in the first image and the size of the first target are determined, namely the positioning information and the image characteristic information of the first target are determined, so that a motion trail model of the first target is established.

And further, predicting the position of the first target in the second images of the multiple frames according to the motion trail model, identifying the position of the first target in the second images of the multiple frames, and matching the identified position with the predicted position to obtain a matching value, namely a first matching result.

In an optional embodiment of the present invention, after performing the target segmentation on the multiple frames of the first image, the method further includes:

It should be noted that, because the target segmented by the target may not be the target in the head or the tail of the interested target, that is, not the first target, or because the difference between the color of the target and the color of the background is smaller than the preset threshold, the segmented target is inaccurate, it is necessary to adopt the preset shape feature information, further obtain the target whose color difference between the color and the background is smaller than the preset threshold in the multi-frame first image after the target is segmented, perfect the target segmentation result of the image, and select the first target from the multiple segmented targets.

In an alternative embodiment of the present invention, step 103 includes:

It should be noted that, by recognizing acoustic information and semantic information in the sound data, the human voice and the accompaniment are converted into a text analysis context, so as to acquire a scene corresponding to the sound data.

And matching the first scene corresponding to the first sound data with the second scene corresponding to the second sound data to obtain a matching value, namely obtaining a second matching result.

In an alternative embodiment of the present invention, step 104 includes:

and acquiring the head end time or the tail start time based on the second time range and the third time range.

Wherein the first matching threshold and the second matching threshold are determined based on empirical or measured values.

And analyzing and comparing the coincidence time of the second time range and the third time range, thereby obtaining the ending time of the leader or the beginning time of the trailer. And the start time of the title is the start time of the target video, and the end time of the title is the end time of the target video, so as to determine the time range of the title and the time range of the title.

It should be noted that, in the embodiment of the present invention, based on the first matching result and the second matching result, a scene recognition result obtained by performing image processing on multiple frames of the first image and multiple frames of the second image and a scene recognition result obtained by performing sound processing on the first sound data and the second sound data may also be obtained, and then the scene recognition results obtained by performing the image processing and the sound processing are respectively input into the comparison model, so as to output the end time of the slice header or the start time of the slice trailer.

Fig. 2 is a schematic application diagram of a video title and end of title recognition method according to an embodiment of the present invention. Next, an application flow of the video title and trailer identification method is specifically described with reference to fig. 2.

Step 201, inputting a target video.

Step 202, performing video retrieval by using a video fingerprint technology, and determining a time range of a leader or a trailer, namely a first time range.

Step 203, extracting a plurality of frames of first images and a plurality of frames of second images in a first time range of the target video.

Step 204, extracting first sound data and second sound data in a first time range of the target video.

And step 205, dividing the target, namely dividing the target in each frame of the first image and each frame of the second image.

And step 206, detecting the target, namely detecting the position and the size of the target in the first image and the second image respectively, and determining the positioning information and the image characteristic information of the target in the first image and the second image.

And step 207, identifying the target, qualitatively identifying the target in the second image based on the image characteristic information, and corresponding the target in the first image with the target in the second image.

And step 208, tracking the target, constructing a motion trail model based on the positioning information of the first target in the first image, predicting the position of the first target in the second image according to the model, and performing scene matching with the identified position of the first target in the second image.

Step 209, obtain the first matching result.

Step 210, respectively performing acoustic information identification on the first sound data and the second sound data.

In step 211, semantic information recognition is performed on the first sound data and the second sound data, respectively.

Step 212, based on the identified scene, a second matching result is obtained.

Step 213, inputting the first matching result and the second matching result into the comparison model for training.

And step 214, outputting the end time of the slice head or the end time of the slice tail.

In summary, the method for recognizing the head and the tail of the video film, provided by the embodiment of the invention, combines the image recognition technology and the voice recognition technology, improves the efficiency and the accuracy of recognizing the head and the tail of the video film, saves a large amount of manpower marking work and reduces the cost.

As shown in fig. 3, an embodiment of the present invention further provides a video title and end of title recognition apparatus, including:

a first obtaining module 301, configured to obtain a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;

a second obtaining module 301, configured to match multiple frames of first images in a first duration before a first time in the first time range with multiple frames of second images in the first duration after the first time, and obtain a first matching result;

a third obtaining module 303, configured to match the first sound data in the first duration before the first time in the first time range with the second sound data in the first duration after the first time, and obtain a second matching result;

a fourth obtaining module 304, configured to obtain a start time or an end time of the target video according to the first matching result and the second matching result.

In the embodiment of the invention, the first time range of scene switching of the target video is obtained according to the identification information of the target video, and matching a plurality of frames of first images within a first time length before the first time in the first time range with a plurality of frames of second images within the first time length after the first time to obtain a first matching result, and matching first sound data in a first time period before a first time in the first time range with second sound data in a first time period after the first time to obtain a second matching result, thereby obtaining the ending time of the title or the beginning time of the end of the title of the target video according to the first matching result and the second matching result, the method analyzes and identifies the ending time of the leader or the beginning time of the trailer by image identification and voice identification technologies, and has the advantages of high identification efficiency and high identification accuracy.

Optionally, in the video title and trailer identification apparatus, the first obtaining module 301 is specifically configured to:

Optionally, in the video title and title identifying apparatus, the second obtaining module 302 is specifically configured to:

Optionally, the video title and end-to-end recognition apparatus further includes:

and the fifth acquisition module is used for acquiring the target with the color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.

Optionally, in the video title and trailer identification apparatus, the third obtaining module 303 is specifically configured to:

Optionally, in the apparatus for identifying a video title and a video title, the fourth obtaining module 304 includes:

the first obtaining unit is used for obtaining a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;

the second obtaining unit is used for obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;

a third obtaining unit, configured to obtain the slice head end time or the slice tail start time based on the second time range and the third time range.

Optionally, in the video title and trailer identification apparatus, the third obtaining unit is specifically configured to:

It should be noted that the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the embodiment of the video title and title recognition method, and can achieve the same technical effects, and details of the same parts and beneficial effects as those of the embodiment of the method are not described herein again.

An embodiment of the present invention further provides a video title and end of title recognition device, as shown in fig. 4, including: a processor 401; and a memory 402 connected to the processor 401 through a bus interface, wherein the memory 402 is used for storing programs and data used by the processor 401 when executing operations, and the processor 401 calls and executes the programs and data stored in the memory 402.

The processor 401 is used for reading the program in the memory 402 and executing the following processes:

and acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result.

A transceiver 403 is coupled to the bus interface for receiving and transmitting data under the control of the processor 401.

Where, in fig. 4, the bus architecture may include any number of interconnected buses and bridges, in particular one or more processors, represented by processor 401, and various circuits, represented by memory 402, linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 403 may be a number of elements including a transmitter and a receiver providing a means for communicating with various other apparatus over a transmission medium. For different user devices, the user interface 404 may also be an interface capable of interfacing externally to a desired device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 401 is responsible for managing the bus architecture and general processing, and the memory 402 may store data used by the processor 401 in performing operations.

Optionally, the processor 401 is further configured to read the computer program and execute the following steps:

performing target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;

Those skilled in the art will understand that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program includes instructions for executing part or all of the steps of the above methods; and the program may be stored in a readable storage medium, which may be any form of storage medium.

An embodiment of the present invention further provides a readable storage medium, where the readable storage medium stores a program, and the program, when executed by a processor, implements the video title and trailer identification method as described in any one of the above.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to perform some steps of the transceiving method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.

While the preferred embodiments of the present invention have been described, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims

1. A video title and tail identification method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining a first time range of scene switching of the target video according to the identification information of the target video comprises:

3. The method according to claim 1, wherein the matching a plurality of first images in a first time period before a first time in the first time range with a plurality of second images in the first time period after the first time to obtain a first matching result comprises:

4. The method according to claim 3, wherein after said performing object segmentation for said plurality of frames of the first image, said method further comprises:

5. The method of claim 1, wherein the matching first sound data in a first time period before a first time in the first time range with second sound data in a first time period after the first time to obtain a second matching result comprises:

6. The method according to claim 1, wherein the obtaining of the end time of the slice header or the start time of the slice trailer of the target video according to the first matching result and the second matching result comprises:

7. The method according to claim 6, wherein the obtaining the end-of-slice time or the end-of-slice time based on the second time range and the third time range comprises:

8. A video title and trailer identification device, comprising:

the first acquisition module is used for acquiring a first time range of scene switching of a target video according to the identification information of the target video; wherein the identification information comprises at least one of sound, image and character identification information;

and the fourth obtaining module is used for obtaining the leader ending time or the trailer starting time of the target video according to the first matching result and the second matching result.

9. A video title and trailer identification device comprising a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; wherein the processor, when executing the program or instructions, implements the steps in the video title and trailer identification method according to any one of claims 1 to 7.

10. A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the steps in the video title and trailer identification method according to any one of claims 1 to 7.