CN115035453A - Video title and tail identification method, device and equipment and readable storage medium - Google Patents

Video title and tail identification method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN115035453A
CN115035453A CN202210692408.4A CN202210692408A CN115035453A CN 115035453 A CN115035453 A CN 115035453A CN 202210692408 A CN202210692408 A CN 202210692408A CN 115035453 A CN115035453 A CN 115035453A
Authority
CN
China
Prior art keywords
time
matching
sound data
target
matching result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210692408.4A
Other languages
Chinese (zh)
Inventor
张楠
冯海洋
李征
张晓迪
孙方明
***
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shida Technology Co ltd
Original Assignee
Beijing Shida Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shida Technology Co ltd filed Critical Beijing Shida Technology Co ltd
Priority to CN202210692408.4A priority Critical patent/CN115035453A/en
Publication of CN115035453A publication Critical patent/CN115035453A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention provides a method, a device and equipment for identifying a video title and a readable storage medium, and relates to the technical field of video identification. The method comprises the following steps: acquiring a first time range of scene switching of the target video according to the identification information of the target video; wherein the identification information comprises at least one item of sound, image and character identification information; matching a plurality of frames of first images in a first time length before the first time in a first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result; matching first sound data in a first time length before a first time in a first time range with second sound data in the first time length after the first time to obtain a second matching result; and acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result. The scheme of the invention can efficiently and accurately identify the head and the tail of the video.

Description

Video title and tail identification method, device and equipment and readable storage medium
Technical Field
The invention belongs to the technical field of video identification, and particularly relates to a method, a device, equipment and a readable storage medium for identifying a video title and a video tail.
Background
The existing video playing software can provide options for users to skip the head and the tail of a video, so that the experience of the users is improved, and the watching time is saved. However, the current video title and end-to-end identification method generally adopts a manual labeling method or a video fingerprint technology.
According to the manual marking method, a large amount of manpower is consumed to watch videos and mark the videos by manually editing the time points of the head and the tail of the films, so that the recognition efficiency is low and the cost is high. The video fingerprint technology is used for detecting video fingerprints only in a set detection time area, then determining the time points of the head and the tail of a film according to a detection result, and depending on the video fingerprints, the detection result of the video fingerprints greatly influences the identification accuracy.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a device, equipment and a readable storage medium for identifying a video title and a video title, so as to solve the problems of low efficiency and low accuracy of identifying the video title and the video title in the prior art.
In order to achieve the above object, an embodiment of the present invention provides a method for identifying a video title and a video title, including:
acquiring a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;
matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;
matching first sound data in a first time length before a first time in the first time range with second sound data in the first time length after the first time to obtain a second matching result;
and acquiring the head end time or the tail end time of the target video according to the first matching result and the second matching result.
Optionally, the method for identifying a beginning and a ending of a video slice, where the first time range of scene switching of the target video is obtained according to the identification information of the target video, includes:
dividing third sound data in a second time length before the target video into a plurality of sub sound data, or dividing third sound data in a second time length after the target video into a plurality of sub sound data;
acquiring multi-frame images in a video segment within a second time length before the target video or acquiring multi-frame images in a video segment within a second time length after the target video according to a preset interval frame number;
acquiring a first time range of scene switching of the target video based on at least one of the following items:
a degree of scene matching between the sound identification information of the plurality of sub-sound data;
scene matching degree among the image identification information of the multi-frame images;
and the scene matching degree among the character recognition information of the multi-frame images.
Optionally, the method for identifying a beginning and a ending of a video slice, where the matching is performed on multiple frames of first images in a first time length before a first time in the first time range and multiple frames of second images in the first time length after the first time to obtain a first matching result, includes:
carrying out target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;
establishing a motion trail model of the first target according to the positioning information and the image characteristic information of the first target in the multiple frames of first images;
acquiring the predicted position of the first target in the plurality of frames of second images according to the motion trail model;
and matching the position of the first target in the multiple frames of second images with the predicted position to obtain the first matching result.
Optionally, after performing target segmentation on the multiple frames of first images, the method for identifying the head and the tail of the video slice further includes:
and acquiring a target with a color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.
Optionally, the method for identifying a beginning and a ending of a video slice, where the matching between first sound data in a first time period before a first time in the first time range and second sound data in the first time period after the first time to obtain a second matching result includes:
acquiring a first scene corresponding to the first sound data and a second scene corresponding to the second sound data by identifying acoustic information and semantic information in the first sound data and the second sound data;
and matching the first scene with the second scene to obtain a second matching result.
Optionally, the method for identifying a beginning and a ending of a title of a video, according to the first matching result and the second matching result, acquiring a beginning and ending time or an ending and starting time of a title of the target video, includes:
acquiring a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;
obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;
and acquiring the end time of the title or the start time of the end of the title based on the second time range and the third time range.
Optionally, the method for identifying a video title and a video title, where the obtaining of the title end time or the title start time based on the second time range and the third time range includes:
and acquiring the end time of the leader or the end time of the trailer according to the superposition time of the second time range and the third time range.
In order to achieve the above object, an embodiment of the present invention further provides a video title and end of title recognition apparatus, including:
the first acquisition module is used for acquiring a first time range of scene switching of a target video according to the identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;
the second obtaining module is used for matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;
a third obtaining module, configured to match first sound data in a first duration before a first time in the first time range with second sound data in the first duration after the first time, and obtain a second matching result;
and the fourth acquisition module is used for acquiring the head ending time or the tail starting time of the target video according to the first matching result and the second matching result.
In order to achieve the above object, an embodiment of the present invention further provides a video title and trailer recognition apparatus, including a transceiver, a processor, a memory, and a program or instructions stored in the memory and executable on the processor; the processor, when executing the program or the instructions, implements the steps in the video title and end-of-title identification method as described in any one of the above.
In order to achieve the above object, an embodiment of the present invention further provides a readable storage medium for storing a computer program, which when executed by a processor implements the steps in the video title and title recognition method as described above.
The technical scheme of the invention at least has the following beneficial effects:
in the above-mentioned scheme of the embodiment of the present invention, according to the identification information of the target video, a first time range in which the target video is subjected to scene switching is obtained, a plurality of frames of first images in a first duration before a first time in the first time range are matched with a plurality of frames of second images in the first duration after the first time, a first matching result is obtained, first sound data in the first duration before the first time in the first time range is matched with second sound data in the first duration after the first time, a second matching result is obtained, so that a slice head end time or a slice tail start time of the target video is obtained according to the first matching result and the second matching result, that is, a slice head end time or a slice tail start time is analyzed and identified through an image identification technology and a sound identification technology, the method has the advantages of high recognition efficiency and high recognition accuracy.
Drawings
Fig. 1 is a flowchart of a video title and end of title recognition method according to an embodiment of the present invention;
fig. 2 is a schematic application diagram of a video title and end of slice identification method according to an embodiment of the present invention;
fig. 3 is a structural diagram of a video title and end of title recognition apparatus according to an embodiment of the present invention;
fig. 4 is a structural diagram of a video title and end of title recognition device according to an embodiment of the present invention.
Detailed Description
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In addition, the terms "system" and "network" are often used interchangeably herein.
In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.
Referring to fig. 1, fig. 1 is a flowchart of a video title and trailer identification method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:
step 101, acquiring a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of sound, image and character identification information;
in the step, a characteristic value can be formed according to the input voice identification information, the input image identification information and the input character identification information, then the characteristic value is input into a supervised deep learning model to determine the scene of the target video, and a first time range of scene switching of the target video is output. The voice recognition information may include music, human dialog, voice-over, etc.; the image recognition information may include people, places, environments, and the like. The first time range of scene switching of the target video can be automatically identified through the deep learning model, and the identification efficiency is improved. The deep learning model is a model trained in advance.
It should be noted that, by this step, two first time ranges of scene switching of the target video may be obtained, which correspond to the slice header and the slice trailer, respectively.
102, matching multiple frames of first images in a first time length before a first time in the first time range with multiple frames of second images in the first time length after the first time to obtain a first matching result;
in this step, according to the first time range obtained in step 101, image processing is performed on the first time range, and a matching result of frame images before and after the first time in the first time range is obtained, so that a time corresponding to a frame image whose front and rear images are not matched is obtained.
103, matching first sound data in a first time length before a first time in the first time range with second sound data in the first time length after the first time to obtain a second matching result;
in this step, according to the first time range obtained in step 101, sound processing is performed on the first time range, and a matching result of sound data before and after the first time in the first time range is obtained, so that a time corresponding to sound data whose sound before and after the first time is not matched is obtained.
And 104, acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result.
It should be noted that, in the embodiment of the present invention, based on the first matching result and the second matching result, a scene recognition result obtained by performing image processing on multiple frames of the first image and multiple frames of the second image and a scene recognition result obtained by performing sound processing on the first sound data and the second sound data are obtained, and then the scene recognition results of the image processing and the sound processing are respectively input into the comparison model, so as to output the end time of the slice header or the start time of the slice trailer.
In the embodiment of the invention, the first time range of scene switching of the target video is obtained according to the identification information of the target video, and matching a plurality of frames of first images within a first time length before the first time in the first time range with a plurality of frames of second images within the first time length after the first time to obtain a first matching result, and matching first sound data in a first time period before a first time in the first time range with second sound data in a first time period after the first time to obtain a second matching result, thereby obtaining the ending time of the title or the beginning time of the end of the title of the target video according to the first matching result and the second matching result, the method analyzes and identifies the leader ending time or the trailer starting time by an image identification technology and a voice identification technology, and has the advantages of high identification efficiency and high identification accuracy.
In an optional embodiment of the present invention, step 101 includes:
dividing third sound data in a second time length before the target video into a plurality of sub sound data, or dividing third sound data in a second time length after the target video into a plurality of sub sound data;
acquiring multi-frame images in a video clip within a second time length before the target video or acquiring multi-frame images in a video clip within a second time length after the target video according to a preset interval frame number;
acquiring a first time range of scene switching of the target video based on at least one of the following items:
a degree of scene matching between the sound identification information of the plurality of sub-sound data;
scene matching degree among the image identification information of the multi-frame images;
and the scene matching degree among the character recognition information of the multi-frame images.
Wherein the second duration is generally determined according to a duration of a slice header or a slice trailer of the video. The third sound data in the second time length before the target video and the video clips in the second time length before the target video are both the first time range for identifying the titles of the target video; and the third sound data in the second time length after the target video and the video clip in the second time length after the target video are both used for identifying the first time range of the end of the target video.
Dividing a plurality of sub-sound data or multi-frame images into a plurality of comparison groups, wherein each group comprises two sub-sound data or two frames of images, then inputting each comparison group into a deep learning model, and outputting scene matching degree.
And determining the corresponding time ranges of at least two sub-sound data or at least two frames of images with the scene matching degree smaller than a preset matching threshold as a first time range based on the scene matching degree of at least one of the sound identification information, the image identification information and the character identification information.
The character recognition information is obtained based on character recognition of a frame image.
That is, this step 101 may be understood as acquiring the first time range by a video fingerprinting technique.
In an alternative embodiment of the present invention, step 102 includes:
carrying out target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;
establishing a motion trail model of the first target according to the positioning information and the image characteristic information of the first target in the multiple frames of first images;
according to the motion trail model, the predicted position of the first target in the multiple frames of second images is obtained;
and matching the position of the first target in the plurality of frames of second images with the predicted position to obtain the first matching result.
It should be noted that, the image segmentation is to extract a part of pixels in an image to express a known target, where, because the start or end of a video is identified, a target that may occur in the start or end of each frame of an image in multiple frames of first images, that is, a part of pixels corresponding to the first target, is segmented.
Then, position detection is carried out on the first target on the first image after the target segmentation, and the position of the first target in the first image and the size of the first target are determined, namely the positioning information and the image characteristic information of the first target are determined, so that a motion trail model of the first target is established.
And further, predicting the position of the first target in the second images of the multiple frames according to the motion trail model, identifying the position of the first target in the second images of the multiple frames, and matching the identified position with the predicted position to obtain a matching value, namely a first matching result.
In an optional embodiment of the present invention, after performing the target segmentation on the multiple frames of the first image, the method further includes:
and acquiring a target with a color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.
It should be noted that, because the target segmented by the target may not be the target in the head or the tail of the interested target, that is, not the first target, or because the difference between the color of the target and the color of the background is smaller than the preset threshold, the segmented target is inaccurate, it is necessary to adopt the preset shape feature information, further obtain the target whose color difference between the color and the background is smaller than the preset threshold in the multi-frame first image after the target is segmented, perfect the target segmentation result of the image, and select the first target from the multiple segmented targets.
In an alternative embodiment of the present invention, step 103 includes:
acquiring a first scene corresponding to the first sound data and a second scene corresponding to the second sound data by identifying acoustic information and semantic information in the first sound data and the second sound data;
and matching the first scene with the second scene to obtain a second matching result.
It should be noted that, by recognizing acoustic information and semantic information in the sound data, the human voice and the accompaniment are converted into a text analysis context, so as to acquire a scene corresponding to the sound data.
And matching the first scene corresponding to the first sound data with the second scene corresponding to the second sound data to obtain a matching value, namely obtaining a second matching result.
In an alternative embodiment of the present invention, step 104 includes:
acquiring a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;
obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;
and acquiring the head end time or the tail start time based on the second time range and the third time range.
Wherein the first matching threshold and the second matching threshold are determined based on empirical or measured values.
And analyzing and comparing the coincidence time of the second time range and the third time range, thereby obtaining the ending time of the leader or the beginning time of the trailer. And the start time of the title is the start time of the target video, and the end time of the title is the end time of the target video, so as to determine the time range of the title and the time range of the title.
It should be noted that, in the embodiment of the present invention, based on the first matching result and the second matching result, a scene recognition result obtained by performing image processing on multiple frames of the first image and multiple frames of the second image and a scene recognition result obtained by performing sound processing on the first sound data and the second sound data may also be obtained, and then the scene recognition results obtained by performing the image processing and the sound processing are respectively input into the comparison model, so as to output the end time of the slice header or the start time of the slice trailer.
Fig. 2 is a schematic application diagram of a video title and end of title recognition method according to an embodiment of the present invention. Next, an application flow of the video title and trailer identification method is specifically described with reference to fig. 2.
Step 201, inputting a target video.
Step 202, performing video retrieval by using a video fingerprint technology, and determining a time range of a leader or a trailer, namely a first time range.
Step 203, extracting a plurality of frames of first images and a plurality of frames of second images in a first time range of the target video.
Step 204, extracting first sound data and second sound data in a first time range of the target video.
And step 205, dividing the target, namely dividing the target in each frame of the first image and each frame of the second image.
And step 206, detecting the target, namely detecting the position and the size of the target in the first image and the second image respectively, and determining the positioning information and the image characteristic information of the target in the first image and the second image.
And step 207, identifying the target, qualitatively identifying the target in the second image based on the image characteristic information, and corresponding the target in the first image with the target in the second image.
And step 208, tracking the target, constructing a motion trail model based on the positioning information of the first target in the first image, predicting the position of the first target in the second image according to the model, and performing scene matching with the identified position of the first target in the second image.
Step 209, obtain the first matching result.
Step 210, respectively performing acoustic information identification on the first sound data and the second sound data.
In step 211, semantic information recognition is performed on the first sound data and the second sound data, respectively.
Step 212, based on the identified scene, a second matching result is obtained.
Step 213, inputting the first matching result and the second matching result into the comparison model for training.
And step 214, outputting the end time of the slice head or the end time of the slice tail.
In summary, the method for recognizing the head and the tail of the video film, provided by the embodiment of the invention, combines the image recognition technology and the voice recognition technology, improves the efficiency and the accuracy of recognizing the head and the tail of the video film, saves a large amount of manpower marking work and reduces the cost.
As shown in fig. 3, an embodiment of the present invention further provides a video title and end of title recognition apparatus, including:
a first obtaining module 301, configured to obtain a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;
a second obtaining module 301, configured to match multiple frames of first images in a first duration before a first time in the first time range with multiple frames of second images in the first duration after the first time, and obtain a first matching result;
a third obtaining module 303, configured to match the first sound data in the first duration before the first time in the first time range with the second sound data in the first duration after the first time, and obtain a second matching result;
a fourth obtaining module 304, configured to obtain a start time or an end time of the target video according to the first matching result and the second matching result.
In the embodiment of the invention, the first time range of scene switching of the target video is obtained according to the identification information of the target video, and matching a plurality of frames of first images within a first time length before the first time in the first time range with a plurality of frames of second images within the first time length after the first time to obtain a first matching result, and matching first sound data in a first time period before a first time in the first time range with second sound data in a first time period after the first time to obtain a second matching result, thereby obtaining the ending time of the title or the beginning time of the end of the title of the target video according to the first matching result and the second matching result, the method analyzes and identifies the ending time of the leader or the beginning time of the trailer by image identification and voice identification technologies, and has the advantages of high identification efficiency and high identification accuracy.
Optionally, in the video title and trailer identification apparatus, the first obtaining module 301 is specifically configured to:
dividing third sound data in a second time length before the target video into a plurality of sub sound data, or dividing third sound data in a second time length after the target video into a plurality of sub sound data;
acquiring multi-frame images in a video segment within a second time length before the target video or acquiring multi-frame images in a video segment within a second time length after the target video according to a preset interval frame number;
acquiring a first time range of scene switching of the target video based on at least one of the following items:
a degree of scene matching between the sound identification information of the plurality of sub-sound data;
scene matching degree among the image identification information of the multi-frame images;
and the scene matching degree among the character recognition information of the multi-frame images.
Optionally, in the video title and title identifying apparatus, the second obtaining module 302 is specifically configured to:
carrying out target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;
establishing a motion trail model of the first target according to the positioning information and the image characteristic information of the first target in the multiple frames of first images;
acquiring the predicted position of the first target in the plurality of frames of second images according to the motion trail model;
and matching the position of the first target in the multiple frames of second images with the predicted position to obtain the first matching result.
Optionally, the video title and end-to-end recognition apparatus further includes:
and the fifth acquisition module is used for acquiring the target with the color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.
Optionally, in the video title and trailer identification apparatus, the third obtaining module 303 is specifically configured to:
acquiring a first scene corresponding to the first sound data and a second scene corresponding to the second sound data by identifying acoustic information and semantic information in the first sound data and the second sound data;
and matching the first scene with the second scene to obtain a second matching result.
Optionally, in the apparatus for identifying a video title and a video title, the fourth obtaining module 304 includes:
the first obtaining unit is used for obtaining a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;
the second obtaining unit is used for obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;
a third obtaining unit, configured to obtain the slice head end time or the slice tail start time based on the second time range and the third time range.
Optionally, in the video title and trailer identification apparatus, the third obtaining unit is specifically configured to:
and acquiring the end time of the leader or the end time of the trailer according to the superposition time of the second time range and the third time range.
It should be noted that the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the embodiment of the video title and title recognition method, and can achieve the same technical effects, and details of the same parts and beneficial effects as those of the embodiment of the method are not described herein again.
An embodiment of the present invention further provides a video title and end of title recognition device, as shown in fig. 4, including: a processor 401; and a memory 402 connected to the processor 401 through a bus interface, wherein the memory 402 is used for storing programs and data used by the processor 401 when executing operations, and the processor 401 calls and executes the programs and data stored in the memory 402.
The processor 401 is used for reading the program in the memory 402 and executing the following processes:
acquiring a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;
matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;
matching first sound data in a first time length before a first time in the first time range with second sound data in the first time length after the first time to obtain a second matching result;
and acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result.
A transceiver 403 is coupled to the bus interface for receiving and transmitting data under the control of the processor 401.
Where, in fig. 4, the bus architecture may include any number of interconnected buses and bridges, in particular one or more processors, represented by processor 401, and various circuits, represented by memory 402, linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 403 may be a number of elements including a transmitter and a receiver providing a means for communicating with various other apparatus over a transmission medium. For different user devices, the user interface 404 may also be an interface capable of interfacing externally to a desired device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.
The processor 401 is responsible for managing the bus architecture and general processing, and the memory 402 may store data used by the processor 401 in performing operations.
Optionally, the processor 401 is further configured to read the computer program and execute the following steps:
dividing third sound data in a second time length before the target video into a plurality of sub sound data, or dividing third sound data in a second time length after the target video into a plurality of sub sound data;
acquiring multi-frame images in a video clip within a second time length before the target video or acquiring multi-frame images in a video clip within a second time length after the target video according to a preset interval frame number;
acquiring a first time range of scene switching of the target video based on at least one of the following items:
a degree of scene matching between the sound identification information of the plurality of sub-sound data;
scene matching degree among the image identification information of the multi-frame images;
and the scene matching degree among the character recognition information of the multi-frame images.
Optionally, the processor 401 is further configured to read the computer program and execute the following steps:
performing target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;
establishing a motion trail model of the first target according to the positioning information and the image characteristic information of the first target in the multiple frames of first images;
acquiring the predicted position of the first target in the plurality of frames of second images according to the motion trail model;
and matching the position of the first target in the plurality of frames of second images with the predicted position to obtain the first matching result.
Optionally, the processor 401 is further configured to read the computer program and execute the following steps:
and acquiring a target with a color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.
Optionally, the processor 401 is further configured to read the computer program and execute the following steps:
acquiring a first scene corresponding to the first sound data and a second scene corresponding to the second sound data by identifying acoustic information and semantic information in the first sound data and the second sound data;
and matching the first scene with the second scene to obtain a second matching result.
Optionally, the processor 401 is further configured to read the computer program and execute the following steps:
acquiring a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;
obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;
and acquiring the end time of the title or the start time of the end of the title based on the second time range and the third time range.
Optionally, the processor 401 is further configured to read the computer program and execute the following steps:
and acquiring the end time of the leader or the end time of the trailer according to the superposition time of the second time range and the third time range.
It should be noted that the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the embodiment of the video title and title recognition method, and can achieve the same technical effects, and details of the same parts and beneficial effects as those of the embodiment of the method are not described herein again.
Those skilled in the art will understand that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program includes instructions for executing part or all of the steps of the above methods; and the program may be stored in a readable storage medium, which may be any form of storage medium.
An embodiment of the present invention further provides a readable storage medium, where the readable storage medium stores a program, and the program, when executed by a processor, implements the video title and trailer identification method as described in any one of the above.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to perform some steps of the transceiving method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.
While the preferred embodiments of the present invention have been described, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims (10)

1. A video title and tail identification method is characterized by comprising the following steps:
acquiring a first time range of scene switching of a target video according to identification information of the target video; wherein the identification information comprises at least one of voice, image and character identification information;
matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;
matching first sound data in a first time length before a first time in the first time range with second sound data in the first time length after the first time to obtain a second matching result;
and acquiring the head end time or the tail start time of the target video according to the first matching result and the second matching result.
2. The method according to claim 1, wherein the obtaining a first time range of scene switching of the target video according to the identification information of the target video comprises:
dividing third sound data in a second time length before the target video into a plurality of sub sound data, or dividing third sound data in a second time length after the target video into a plurality of sub sound data;
acquiring multi-frame images in a video segment within a second time length before the target video or acquiring multi-frame images in a video segment within a second time length after the target video according to a preset interval frame number;
acquiring a first time range of scene switching of the target video based on at least one of the following items:
a degree of scene matching between the sound identification information of the plurality of sub-sound data;
scene matching degree among the image identification information of the multi-frame images;
and the scene matching degree among the character recognition information of the multi-frame images.
3. The method according to claim 1, wherein the matching a plurality of first images in a first time period before a first time in the first time range with a plurality of second images in the first time period after the first time to obtain a first matching result comprises:
carrying out target segmentation on the multi-frame first image, and acquiring positioning information and image characteristic information of the segmented first target;
establishing a motion trail model of the first target according to the positioning information and the image characteristic information of the first target in the multiple frames of first images;
acquiring the predicted position of the first target in the plurality of frames of second images according to the motion trail model;
and matching the position of the first target in the plurality of frames of second images with the predicted position to obtain the first matching result.
4. The method according to claim 3, wherein after said performing object segmentation for said plurality of frames of the first image, said method further comprises:
and acquiring a target with a color difference smaller than a preset threshold value from the background in the multi-frame first image by adopting preset shape characteristic information.
5. The method of claim 1, wherein the matching first sound data in a first time period before a first time in the first time range with second sound data in a first time period after the first time to obtain a second matching result comprises:
acquiring a first scene corresponding to the first sound data and a second scene corresponding to the second sound data by identifying acoustic information and semantic information in the first sound data and the second sound data;
and matching the first scene with the second scene to obtain a second matching result.
6. The method according to claim 1, wherein the obtaining of the end time of the slice header or the start time of the slice trailer of the target video according to the first matching result and the second matching result comprises:
acquiring a second time range corresponding to at least one frame of second image of which the first matching result is smaller than a first matching threshold;
obtaining a third time range corresponding to at least one second sound data of which the second matching result is smaller than a second matching threshold;
and acquiring the end time of the title or the start time of the end of the title based on the second time range and the third time range.
7. The method according to claim 6, wherein the obtaining the end-of-slice time or the end-of-slice time based on the second time range and the third time range comprises:
and acquiring the end time of the leader or the end time of the trailer according to the superposition time of the second time range and the third time range.
8. A video title and trailer identification device, comprising:
the first acquisition module is used for acquiring a first time range of scene switching of a target video according to the identification information of the target video; wherein the identification information comprises at least one of sound, image and character identification information;
the second obtaining module is used for matching a plurality of frames of first images in a first time length before the first time in the first time range with a plurality of frames of second images in the first time length after the first time to obtain a first matching result;
a third obtaining module, configured to match first sound data in a first duration before a first time in the first time range with second sound data in the first duration after the first time, and obtain a second matching result;
and the fourth obtaining module is used for obtaining the leader ending time or the trailer starting time of the target video according to the first matching result and the second matching result.
9. A video title and trailer identification device comprising a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; wherein the processor, when executing the program or instructions, implements the steps in the video title and trailer identification method according to any one of claims 1 to 7.
10. A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the steps in the video title and trailer identification method according to any one of claims 1 to 7.
CN202210692408.4A 2022-06-17 2022-06-17 Video title and tail identification method, device and equipment and readable storage medium Pending CN115035453A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210692408.4A CN115035453A (en) 2022-06-17 2022-06-17 Video title and tail identification method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210692408.4A CN115035453A (en) 2022-06-17 2022-06-17 Video title and tail identification method, device and equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115035453A true CN115035453A (en) 2022-09-09

Family

ID=83124790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210692408.4A Pending CN115035453A (en) 2022-06-17 2022-06-17 Video title and tail identification method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115035453A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116017048A (en) * 2022-12-28 2023-04-25 北京奇艺世纪科技有限公司 Method and device for identifying start position of tail, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116017048A (en) * 2022-12-28 2023-04-25 北京奇艺世纪科技有限公司 Method and device for identifying start position of tail, electronic equipment and storage medium
CN116017048B (en) * 2022-12-28 2024-06-04 北京奇艺世纪科技有限公司 Method and device for identifying start position of tail, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109117777B (en) Method and device for generating information
US11276407B2 (en) Metadata-based diarization of teleconferences
CN105677735B (en) Video searching method and device
CN110234037B (en) Video clip generation method and device, computer equipment and readable medium
US10108709B1 (en) Systems and methods for queryable graph representations of videos
CN110364146B (en) Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium
US20220375225A1 (en) Video Segmentation Method and Apparatus, Device, and Medium
CN113365147B (en) Video editing method, device, equipment and storage medium based on music card point
CN113613065B (en) Video editing method and device, electronic equipment and storage medium
CN110650374A (en) Clipping method, electronic device, and computer-readable storage medium
CN110751224A (en) Training method of video classification model, video classification method, device and equipment
CN110347866B (en) Information processing method, information processing device, storage medium and electronic equipment
CN110619035B (en) Method, device, equipment and storage medium for identifying keywords in interview video
CN109785846B (en) Role recognition method and device for mono voice data
CN113806500B (en) Information processing method, device and computer equipment
CN111222397A (en) Drawing book identification method and device and robot
CN113642536B (en) Data processing method, computer device and readable storage medium
CN112488222B (en) Crowdsourcing data labeling method, system, server and storage medium
CN112163074A (en) User intention identification method and device, readable storage medium and electronic equipment
US20220157322A1 (en) Metadata-based diarization of teleconferences
CN115035453A (en) Video title and tail identification method, device and equipment and readable storage medium
CN114051154A (en) News video strip splitting method and system
CN113949828A (en) Video editing method and device, electronic equipment and storage medium
CN112992148A (en) Method and device for recognizing voice in video
CN113194333B (en) Video editing method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination