WO2019144840A1 - Method and apparatus for acquiring video semantic information - Google Patents

Method and apparatus for acquiring video semantic information Download PDF

Info

Publication number
WO2019144840A1
WO2019144840A1 PCT/CN2019/072219 CN2019072219W WO2019144840A1 WO 2019144840 A1 WO2019144840 A1 WO 2019144840A1 CN 2019072219 W CN2019072219 W CN 2019072219W WO 2019144840 A1 WO2019144840 A1 WO 2019144840A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
determining
semantic information
information
frames
Prior art date
Application number
PCT/CN2019/072219
Other languages
French (fr)
Chinese (zh)
Inventor
罗江春
陈锡岩
Original Assignee
北京一览科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京一览科技有限公司 filed Critical 北京一览科技有限公司
Publication of WO2019144840A1 publication Critical patent/WO2019144840A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Definitions

  • the present invention relates to the field of video technologies, and in particular, to a technique for acquiring video semantic information.
  • the method for obtaining video content mainly includes: obtaining according to a video introduction, or obtaining an analysis by analyzing the video content.
  • the former is mainly based on video introduction, and the video content covered by the video introduction is limited, and cannot reflect the specific details of the video content; the latter mainly performs character recognition and emotion recognition on the video screen. Therefore, the restored video information is limited and cannot be Completely restore the specific semantic information corresponding to the video.
  • a method for acquiring video semantic information comprises the following steps:
  • the step of extracting one or more video frames in the video comprises:
  • a plurality of video frames in the video are extracted, wherein the plurality of video frames are contiguous.
  • the method further includes:
  • the step of extracting multiple video frames in the video includes:
  • the step of extracting one or more video frames in the video comprises:
  • the triggering condition includes at least one of the following:
  • Trigger according to the length of play time of the video
  • Triggered according to one or more play content of the video Triggered according to one or more play content of the video.
  • the step of determining the visual object included in the video frame comprises:
  • Performing target extraction on the video frame combining video related information of the video to determine a visual object included in the video frame.
  • the step of determining scenario information corresponding to the one or more video frames includes:
  • the step of determining an object feature corresponding to each of the visual objects includes:
  • the step of determining video semantic information corresponding to the video frame of the video includes:
  • the method further includes:
  • the step of determining video semantic information corresponding to the video frame of the video includes:
  • the method further includes:
  • the video retrieval sequence is matched with the video semantic information to determine a target video corresponding to the video retrieval sequence.
  • a processing device for acquiring video semantic information includes:
  • means for extracting one or more video frames in the video is used to:
  • a plurality of video frames in the video are extracted, wherein the plurality of video frames are contiguous.
  • the processing device further includes:
  • the means for extracting a plurality of video frames in the video is used to:
  • means for extracting one or more video frames in the video is used to:
  • the triggering condition includes at least one of the following:
  • Trigger according to the length of play time of the video
  • Triggered according to one or more play content of the video Triggered according to one or more play content of the video.
  • means for determining a visual object contained in the video frame is for:
  • Performing target extraction on the video frame combining video related information of the video to determine a visual object included in the video frame.
  • the apparatus for determining scenario information corresponding to the one or more video frames includes:
  • a unit for determining an object feature corresponding to each of the visual objects is used to:
  • means for determining video semantic information corresponding to the video frame of the video is used to:
  • the processing device further includes:
  • the means for determining video semantic information corresponding to the video frame of the video is used for:
  • the processing device further includes:
  • a computer readable storage medium wherein the computer storage medium stores computer readable instructions that are executed by one or more devices The apparatus is caused to perform the method as described above.
  • a system for acquiring video semantic information characterized in that the system comprises a memory and a processor, wherein the memory stores computer readable instructions, when the computer is When a read instruction is executed by the processor, the processor performs the method as described above.
  • the present invention determines the one or the visual object included in the video frame by performing target extraction on the video frame in the video, and then determining the one or the image according to the object feature corresponding to the visual object.
  • the scene information corresponding to the plurality of video frames is determined, and finally, the video semantic information corresponding to the video frame of the video is determined according to the scene information; thus, the invention can automatically obtain detailed and complete video semantic information based on the video, thereby saving A large number of artificial resources obtained by video semantics, and the acquired video semantic information can facilitate subsequent analysis or search for video.
  • the present invention is also capable of analyzing continuous video frames or consecutive video frames corresponding to the same scene, thereby making the acquired video semantic information more complete and accurate.
  • the present invention can also extract one or more video frames in the video based on different trigger conditions, thereby realizing the directional extraction of the video frames, thereby implementing fully automatic video semantic analysis and facilitating video semantic information.
  • Directional analysis and acquisition improve processing efficiency and save a lot of human resources.
  • the present invention is also capable of combining the video related information of the video to determine the visual object contained in the video frame, thereby making the determined visual object more accurate, and further improving the accuracy of the obtained video semantic information.
  • the present invention is also capable of determining scene information corresponding to the one or more video frames according to association information between object features corresponding to each of the visual objects; further, according to each of the visions An object attribute of the object, determining an object feature corresponding to each of the visual objects.
  • the present invention improves the accuracy of the acquired scene information and further improves the accuracy of the obtained video semantic information.
  • the present invention is further capable of determining video semantic information corresponding to a video frame of the video according to the scene information, in combination with the voice and/or subtitle information, thereby improving accuracy of the acquired video semantic information.
  • FIG. 1 shows a schematic diagram of a processing device for acquiring video semantic information, in accordance with an aspect of the present invention
  • FIG. 2 shows a flow chart of a method for acquiring video semantic information in accordance with another aspect of the present invention.
  • processing device refers to a smart electronic device that can perform predetermined processing such as numerical calculations and/or logical calculations by running a predetermined program or instruction.
  • the device which may include a processor and a memory, the processor executes a pre-stored instruction stored in the memory to perform a predetermined process, or performs a predetermined process by hardware such as an ASIC, an FPGA, a DSP, or the like, or a combination of the two. achieve.
  • the computer device includes a user device and/or a network device.
  • the user equipment includes, but is not limited to, a computer, a smart phone, a PDA, etc.
  • the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing based computer Or a cloud composed of a network server, wherein cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computers.
  • the computer device can be operated separately to implement the present invention, and can also access the network and implement the present invention by interacting with other computer devices in the network.
  • the network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
  • the "processing device" described in the present invention may be only a network device, that is, a network device performs a corresponding operation; in a special case, it may also be a user device and The network device or the server is integrated, that is, the user equipment cooperates with the network device to perform a corresponding operation, for example, the user equipment sends an instruction to the network device to instruct the network device to start performing the corresponding operation of “acquiring the video semantic information”.
  • the user equipment, the network equipment, the network, and the like are merely examples, and other existing or future possible computer equipment or networks, such as those applicable to the present invention, are also included in the scope of the present invention. It is included here by reference.
  • FIG. 1 shows a schematic diagram of a processing device for acquiring video semantic information in accordance with an aspect of the present invention
  • the processing device includes means for extracting one or more video frames in a video (hereinafter referred to as "first” Apparatus 1") means for performing target extraction on the video frame to determine a visual object included in the video frame (hereinafter referred to as "second apparatus 2"); for corresponding to the visual object Means for determining scene information corresponding to the one or more video frames (hereinafter referred to as "third device 3"); for determining, according to the scene information, a video frame corresponding to the video A device for video semantic information (hereinafter referred to as "fourth device 4").
  • first Apparatus 1 means for performing target extraction on the video frame to determine a visual object included in the video frame
  • second apparatus 2 for corresponding to the visual object Means for determining scene information corresponding to the one or more video frames
  • third device 3 for determining, according to the scene information, a video frame corresponding to
  • the first device 1 extracts one or more video frames in the video.
  • the first device 1 extracts one or more video frames from the video to be analyzed by means of automatic extraction or manual designation extraction.
  • the extracted video frames may be one or more, or may be continuous or discontinuous.
  • the subsequent device performs video semantic information analysis on the extracted video frame.
  • the first device 1 extracts a plurality of video frames in a video, wherein the plurality of video frames are continuous. That is, the first device 1 can simultaneously extract multiple consecutive video frames in the same video; here, the consecutive video frames, that is, the playing time of two adjacent video frames are within a certain threshold range Multiple video frames. After the first device 1 extracts the continuous video frame, the subsequent device performs video semantic information analysis on the extracted continuous video frames.
  • the performed analysis can analyze continuous video frames as a whole to obtain one video semantic information for multiple consecutive video frames; or analyze successive video frames separately for multiple consecutive videos.
  • the frame gets multiple video semantic information.
  • the processing device further includes means for scene segmentation of the video (hereinafter referred to as "fifth device", not shown).
  • the fifth device performs scene segmentation on the video.
  • the first device 1 then extracts multiple video frames in the video according to the scene corresponding to the video, where the multiple video frames are It is continuous and corresponds to the same scene.
  • the fifth device performs scene segmentation on the video according to time, color, character change, and the like of the video.
  • the video is set to a default setting of one scene per minute, and each minute of the video is divided into one scene for segmentation; or, according to color changes in the video, such as switching from cool to warm
  • the scene is divided to perform scene segmentation; or, according to the person change in the video, if the screen changes from two people to three people, the scene is considered to have changed, thereby performing scene segmentation.
  • the first device 1 performs video frame extraction on the video according to each scene according to the divided video, in other words, the first device 1 may access at least one of the multiple scenes.
  • the video frames are extracted and the extracted video frames are contiguous and correspond to the scene.
  • the first device 1 extracts one or more video frames in the video when one or more trigger conditions are met.
  • the first device 1 may detect the video in real time or based on an event, and determine whether the current video meets a trigger condition; when the trigger condition is met, the first device 1 extracts One or more video frames in the video.
  • the triggering condition includes at least one of the following:
  • Triggering according to the length of the playing time of the video for example, when the playing time length of the video is in the fifth minute, the extraction of the video frame is performed, where if a certain user starts from the beginning of the video (ie, 0) Start playing in 0 seconds), indicating that the video frame is extracted in the fifth minute (ie, 5 minutes and 0 seconds). If a user starts playing from the tenth minute of the video (ie, 10 minutes and 0 seconds), it means that The video frame is extracted at the fifteenth minute (ie, 15 minutes and 0 seconds).
  • Triggering according to the playback time point of the video for example, the extraction of the video frame is triggered at the third, fifth, and seventh minutes of the video.
  • Triggering according to one or more play content of the video: wherein the play content includes, but is not limited to, a voice, a person, an item, a pattern, and the like.
  • the play content includes, but is not limited to, a voice, a person, an item, a pattern, and the like.
  • the video has a voice
  • the extraction of the video frame is triggered; when the character “Cao Cao” appears in the video, the extraction of the video frame is triggered; when the video “Coca-Cola” appears in the video, the trigger is triggered.
  • the extraction of the video frame; when the video shows the LOGO of "CCTV", the extraction of the video frame and the like are triggered.
  • the first device 1 may determine, according to a default setting, "how many video frames are extracted, the video length corresponding to the extracted video frame when triggering the extraction of the video frame", etc.; Determining the trigger condition to determine the parameter settings of the extracted video frame, for example, extracting the frequency (fetching a few frames in one second), extracting the number (how many video frames are extracted), and the video length corresponding to the extracted video frame (eg, when After extracting a video frame from a certain point in time, at which point in time, the extraction is stopped, which video frames are extracted, and the like.
  • different trigger conditions may correspond to different parameter settings. For example, if the playback content of the triggered video frame extraction is different, the parameter settings of the corresponding extracted video frames are also different.
  • the second device 2 performs target extraction on the video frame to determine a visual object included in the video frame.
  • the second device 2 extracts the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. a target; then, by comparing the extracted target with one or more object models to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame.
  • the visual objects include, but are not limited to, people, objects, backgrounds, icons, and the like.
  • the object model may be pre-set, or may be continuously learned and acquired based on a machine learning method.
  • the second device 2 may perform target extraction on the video frame, and combine video related information of the video to determine a visual object included in the video frame.
  • the video related information includes, but is not limited to, a video content introduction, a video main character profile, a video related search result, a video producer/author, and the like.
  • the second device 2 extracts the target included in the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. Then, by matching the extracted target with one or more object models associated with the video related information to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame; Alternatively, the extracted target is first matched with one or more common object models, and when the category of the target (such as a character, an item, a background, an icon, etc.) is determined, the target is A specific model in the category associated with the video information is matched to determine a visual object corresponding to the target, thereby determining a visual object contained in the video frame.
  • various image processing algorithms such as performing texture extraction, color analysis, and the like on the video frame.
  • the extracted target (mainly a character-like target) is matched with an image model of each actor in the cast, to determine whether the extracted target is Is the actor; or, matching the extracted target with the general model, first determining that the target's category is "person", and then matching the person target with the image model of each actor of the video to determine Whether the target of the extraction is the actor.
  • the third device 3 determines the scene information corresponding to the one or more video frames according to the object features corresponding to the visual object.
  • the third device 3 determines an object feature corresponding to the visual object by analyzing an image feature of the visual object.
  • the object features include, but are not limited to, actions, emotions, colors, positions, and the like. Further, for a plurality of visual objects within the same video frame, the object features further include interactions or associations between the plurality of visual objects. .
  • the third device 3 analyzes the mutual relationship between the plurality of object features according to the object feature to determine scene information corresponding to the one or more video frames.
  • the determination of the scene information may be performed based on a preset object feature model, where the object feature model includes a mapping relationship between multiple object features or a combination thereof and different scene information.
  • the scene information includes, but is not limited to, information such as humor, horror, hilarity, pleasure, and the like that express a scene state or a scene atmosphere.
  • the second device 2 determines that the visual object is "sofa” or “Wang Lee Hom", and the object of the "sofa” is characterized by “color is beige” through the analysis of the visual object by the third device 3.
  • the object features of "Wang Lee Hom” are “sitting” and “smile”, and the third device 3 is based on the “sofa”, “beige”, “sitting”, “smile”, “Wang Lihong (person)” in the object feature model.
  • the mapping relationship is determined to be “pleasure” by analyzing the "color emotion", "expression emotion", and the like.
  • the third device 3 when it analyzes the object features, it may first analyze based on the two-two object features, or combine the two-two object features with the object feature model. The mapping relationship is matched; thus, after multiple rounds of analysis or matching, the final scene information is finally obtained. In addition, the third device 3 can directly analyze and match all the object features, thereby directly obtaining the scene information.
  • the third device 3 includes means for determining an object feature corresponding to each of the visual objects (hereinafter referred to as “three-one unit”, not shown), and for each of the visual objects Corresponding information between the corresponding object features, a unit for determining scene information corresponding to the one or more video frames (hereinafter referred to as “three two units”, not shown).
  • the three-unit unit separately analyzes image features of each of the visual objects, thereby respectively determining object features corresponding to the visual objects.
  • the three unit may determine an object feature corresponding to each of the visual objects according to an object attribute of each of the visual objects.
  • the three-one unit first determines an object attribute corresponding to each visual object based on a name of the visual object or the like; wherein the object attribute includes, but is not limited to, various categories of the visual object.
  • the object attribute corresponding to it is “furniture”, and correspondingly, the object feature of “furniture” should include “color, shape, size”, etc.; for example,
  • the corresponding object attribute is “person” and/or “entertainment star”, and accordingly, the "person” has the object characteristics of "gender, expression, action, costume”, etc.
  • the object features of "Entertainment Stars” are "name” and the like.
  • the trin unit analyzes the specific visual object according to the determined object feature according to the determined object attribute to obtain the object feature corresponding to the visual object.
  • the object characteristics of "sofa” is “color beige, L-shaped sofa, large”
  • the object characteristics of "Wang Lihong” are “sex male, smiling expression, sitting down, dress shirt, name Wang Lihong”.
  • the third unit determines the scene information corresponding to the one or more video frames according to the association information between the object features corresponding to each of the visual objects.
  • the association information may be derived based on a preset association model; the association model includes association information between two or more different object features, and the association information may be based on a large number of Data analysis and training were obtained. For example, “sofa” and “sit down” have associated information, “smile” and “sit down” also have relevance information, and “beige” and “smile” also have emotionally related information (ie, both express pleasure).
  • the determined scene information is "pleasant”.
  • the fourth device 4 determines video semantic information corresponding to the video frame of the video according to the scene information.
  • the fourth device 4 converts the scene information into information in the form of “text or voice” according to the determined scene information, so as to be the video semantic information corresponding to the video frame of the video.
  • the video semantic information includes, but is not limited to, a scene profile, a scene state, and/or scene detail information, such as time, place, person, action, mood, and the like.
  • the emotions include happiness, anger, fear, calmness, etc.
  • the scene information includes humor, horror, hilarity, pleasure, and the like.
  • the fourth device 4 may directly use the scene information as the video semantic information, or may organize the scene information as the video semantic information to conform to language expression habits, for example,
  • the video semantic information may be a "pleasure scene", or may be such as “Wang Lihong smiles and sits on the sofa” and the like, “someone is doing something in a certain place", “someone has some emotion Specific information such as “doing something”, “what state is between something and something”.
  • the fourth device 4 may semantically combine the visual object and the object features corresponding to the visual object to generate candidate video semantic information; and combine the candidate video semantic information according to the scene information. Determining video semantic information corresponding to the video frame of the video.
  • the fourth device 4 may semantically combine the visual object and the object feature corresponding to the visual object, for example, in the above example, when the second device 2 determines that the visual object is a “sofa” “Wang Lihong”, and the object feature of "sofa” is "color is beige”, “Wang Leehom” object features "sit", "smile”, the fourth device 4 can semantically combine the above object features Such as “sofa + beige”, “sofa + Wang Lihong + sitting”, “smile + sit” and so on.
  • the generated semantic combination described above is used as candidate video semantic information.
  • the above semantic combination may be a combination of two or two, or a combination of multiple object features; it may be a combination of multiple object features of the same visual object, or may be different visual objects.
  • a combination of multiple object features may be a combination of two or two, or a combination of multiple object features; it may be a combination of multiple object features of the same visual object, or may be different visual objects.
  • the fourth device 4 recombines or selects the candidate video semantic information according to the scene information; further, the candidate video semantic information may be polished according to the scene information, thereby determining Video semantic information corresponding to the video frame of the video.
  • the fourth device 4 takes “beige” and “smile” containing "pleasure” as part of the video semantic information, thereby determining the video.
  • the semantic information is "sit on the beige sofa and smile”.
  • the processing device further comprises means for acquiring voice and/or subtitle information corresponding to the one or more video frames (hereinafter referred to as "sixth device", not shown), the fourth The device 4 determines the video semantic information corresponding to the video frame of the video according to the scene information, in combination with the voice and/or subtitle information.
  • the sixth device may also obtain the voice file or the caption file of the video corresponding to the video frame, or obtain the voice or the caption extraction of the video. Voice and/or subtitle information corresponding to multiple video frames.
  • the fourth device 4 combines the scene information and the object feature information by using the voice and/or subtitle according to the scene information, in combination with the voice and/or subtitle information, to generate video semantics.
  • the processing device further comprises means for acquiring one or more video retrieval sequences (hereinafter referred to as “seventh device”, not shown) and for performing the video retrieval sequence and the video semantic information A device (hereinafter referred to as “eighth device”, not shown) that matches to determine a target video corresponding to the video search sequence.
  • the processing device further comprises means for acquiring one or more video retrieval sequences (hereinafter referred to as “seventh device”, not shown) and for performing the video retrieval sequence and the video semantic information A device (hereinafter referred to as “eighth device”, not shown) that matches to determine a target video corresponding to the video search sequence.
  • the seventh device acquires one or more video retrieval sequences by directly interacting with a user or with other devices capable of providing a video retrieval sequence; then, the eighth device compares the video retrieval sequence with The determined video semantic information of each frame/continuous frame corresponding to each video is matched. If the video retrieval sequence matches the video semantic information, the video corresponding to the video semantic information is used as the target video.
  • processing device may further provide the target video to a user who sends the video retrieval sequence.
  • step S1 the processing device extracts one or more video frames in the video; in step S2, the processing device performs target extraction on the video frame to determine included in the video frame. a visual object; in step S3, the processing device determines scene information corresponding to the one or more video frames according to the object feature corresponding to the visual object; in step S4, the processing device according to the The scene information is used to determine video semantic information corresponding to the video frame of the video.
  • step S1 the processing device extracts one or more video frames in the video.
  • step S1 the processing device extracts one or more video frames from the video to be analyzed by means of automatic extraction or manual designation.
  • the extracted video frames may be one or more, or may be continuous or discontinuous.
  • the subsequent step performs video semantic information analysis on the extracted video frame.
  • step S1 the processing device extracts a plurality of video frames in the video, wherein the plurality of video frames are continuous. That is, the processing device can simultaneously extract multiple consecutive video frames in the same video; here, the continuous video frames, that is, the playback time of two adjacent video frames are within a certain threshold range. Video frames.
  • the subsequent step performs the analysis of the video semantic information on the extracted continuous video frame.
  • the performed analysis can analyze continuous video frames as a whole to obtain one video semantic information for multiple consecutive video frames; or analyze successive video frames separately for multiple consecutive videos.
  • the frame gets multiple video semantic information.
  • the method further comprises a step S5.
  • step S5 the processing device performs scene segmentation on the video.
  • step S1 the processing device extracts multiple video frames in the video according to the scene corresponding to the video, where The plurality of video frames are contiguous and correspond to the same scene.
  • step S5 the processing device performs scene segmentation on the video according to time, color, character change, and the like of the video.
  • the video is set to a default setting of one scene per minute, and each minute of the video is divided into one scene for segmentation; or, according to color changes in the video, such as switching from cool to warm
  • the scene is divided to perform scene segmentation; or, according to the person change in the video, if the screen changes from two people to three people, the scene is considered to have changed, thereby performing scene segmentation.
  • step S1 the processing device performs video frame extraction on the video according to each scene according to the segmented video, in other words, the processing device may perform at least one of the multiple scenes.
  • the extracted video frames are contiguous and correspond to the scene.
  • step S1 the processing device extracts one or more video frames in the video.
  • the processing device may detect the video in real time or based on an event, and determine whether the current video meets a trigger condition; when the trigger condition is met, the processing device Extract one or more video frames in the video.
  • the triggering condition includes at least one of the following:
  • Triggering according to the length of the playing time of the video for example, when the playing time length of the video is in the fifth minute, the extraction of the video frame is performed, where if a certain user starts from the beginning of the video (ie, 0) Start playing in 0 seconds), indicating that the video frame is extracted in the fifth minute (ie, 5 minutes and 0 seconds). If a user starts playing from the tenth minute of the video (ie, 10 minutes and 0 seconds), it means that The video frame is extracted at the fifteenth minute (ie, 15 minutes and 0 seconds).
  • Triggering according to the playback time point of the video for example, the extraction of the video frame is triggered at the third, fifth, and seventh minutes of the video.
  • Triggering according to one or more play content of the video: wherein the play content includes, but is not limited to, a voice, a person, an item, a pattern, and the like.
  • the play content includes, but is not limited to, a voice, a person, an item, a pattern, and the like.
  • the video has a voice
  • the extraction of the video frame is triggered; when the character “Cao Cao” appears in the video, the extraction of the video frame is triggered; when the video “Coca-Cola” appears in the video, the trigger is triggered.
  • the extraction of the video frame; when the video shows the LOGO of "CCTV", the extraction of the video frame and the like are triggered.
  • the processing device may determine, according to a default setting, “how many video frames are extracted, the video length corresponding to the extracted video frame when triggering the extraction of the video frame”, etc.; Condition, to determine the parameter settings for extracting the video frame, for example, extracting the frequency (fetching a few frames in one second), extracting the number (how many video frames are extracted), and the video length corresponding to the extracted video frame (eg, when from a certain After a time point starts to extract the video frame, then at which point in time to stop the extraction), which video frames are extracted, and the like.
  • different trigger conditions may correspond to different parameter settings. For example, if the playback content of the triggered video frame extraction is different, the parameter settings of the corresponding extracted video frames are also different.
  • step S2 the processing device performs target extraction on the video frame to determine a visual object included in the video frame.
  • step S2 the processing device extracts the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame.
  • the target included in the video frame is then determined by comparing the extracted target with one or more object models to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame.
  • the visual objects include, but are not limited to, people, objects, backgrounds, icons, and the like.
  • the object model may be pre-set, or may be continuously learned and acquired based on a machine learning method.
  • the processing device may perform target extraction on the video frame, and combine video related information of the video to determine a visual object included in the video frame.
  • the video related information includes, but is not limited to, a video content introduction, a video main character profile, a video related search result, a video producer/author, and the like.
  • step S2 the processing device extracts the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. The goal. Then, by matching the extracted target with one or more object models associated with the video related information to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame; Alternatively, the extracted target is first matched with one or more common object models, and when the category of the target (such as a character, an item, a background, an icon, etc.) is determined, the target is A specific model in the category associated with the video information is matched to determine a visual object corresponding to the target, thereby determining a visual object contained in the video frame.
  • the category of the target such as a character, an item, a background, an icon, etc.
  • the extracted target (mainly a character-like target) is matched with an image model of each actor in the cast, to determine whether the extracted target is Is the actor; or, matching the extracted target with the general model, first determining that the target's category is "person", and then matching the person target with the image model of each actor of the video to determine Whether the target of the extraction is the actor.
  • step S3 the processing device determines the scene information corresponding to the one or more video frames according to the object features corresponding to the visual object.
  • step S3 the processing device determines an object feature corresponding to the visual object by analyzing an image feature of the visual object.
  • the object features include, but are not limited to, actions, emotions, colors, positions, and the like. Further, for a plurality of visual objects within the same video frame, the object features further include interactions or associations between the plurality of visual objects. .
  • the processing device analyzes the relationship between the plurality of the object features according to the object feature to determine scene information corresponding to the one or more video frames.
  • the determination of the scene information may be performed based on a preset object feature model, where the object feature model includes a mapping relationship between multiple object features or a combination thereof and different scene information.
  • the scene information includes, but is not limited to, information such as humor, horror, hilarity, pleasure, and the like that express a scene state or a scene atmosphere.
  • the processing device determines that the visual object is “sofa” or “Wang Lihong”, and the object of the “sofa” is “color is beige” and “Wang Lihong” is analyzed by the processing device.
  • the feature is “sitting” and “smile”, then in step S3, the processing device maps “sofa”, “beige”, “sitting”, “smile”, “Wang Lihong (person)” based on the object feature model.
  • the relationship is determined to be “pleasure” by analyzing based on the analysis of "color emotion”, "expression emotion", and the like.
  • the processing device analyzes the object features, it may first analyze based on the two-two object features, or map the two-two object features with the object feature model. The relationship is matched; thus, after multiple rounds of analysis or matching, the final scene information is finally obtained.
  • the processing device can directly analyze and match all the object features, thereby directly obtaining the scene information.
  • the step S3 includes a step S31 (not shown) and a step S32 (not shown); wherein, in step S31, the processing device determines an object feature corresponding to each of the visual objects; In S32, the processing device determines the scene information corresponding to the one or more video frames according to the association information between the object features corresponding to each of the visual objects.
  • step S31 the processing device separately analyzes image features of each of the visual objects, thereby respectively determining object features corresponding to the visual objects.
  • the processing device may determine an object feature corresponding to each of the visual objects according to an object attribute of each of the visual objects.
  • step S31 the processing device first determines an object attribute corresponding to each visual object based on a name of the visual object or the like; wherein the object attribute includes, but is not limited to, various types of the visual object. category.
  • the object attribute corresponding to it is “furniture”, and correspondingly, the object feature of “furniture” should include “color, shape, size”, etc.; for example,
  • the corresponding object attribute is “person” and/or “entertainment star”, and accordingly, the "person” has the object characteristics of "gender, expression, action, costume”, etc.
  • the object features of "Entertainment Stars” are "name” and the like.
  • step S31 the processing device analyzes the specific visual object according to the determined object feature according to the determined object attribute to obtain the object feature corresponding to the visual object.
  • the object characteristics of "sofa” is “color beige, L-shaped sofa, large”
  • the object characteristics of "Wang Lihong” are “sex male, smiling expression, sitting down, dress shirt, name Wang Lihong”.
  • the processing device determines the scene information corresponding to the one or more video frames according to the association information between the object features corresponding to each of the visual objects.
  • the association information may be derived based on a preset association model; the association model includes association information between two or more different object features, and the association information may be based on a large number of Data analysis and training were obtained. For example, “sofa” and “sit down” have associated information, “smile” and “sit down” also have relevance information, and “beige” and “smile” also have emotionally related information (ie, both express pleasure).
  • the determined scene information is "pleasant”.
  • step S4 the processing device determines video semantic information corresponding to the video frame of the video according to the scene information.
  • the processing device converts the scene information into information in the form of “text or voice” according to the determined scene information, so as to be the video semantic information corresponding to the video frame of the video.
  • the video semantic information includes, but is not limited to, a scene profile, a scene state, and/or scene detail information, such as time, place, person, action, mood, and the like.
  • the emotions include happiness, anger, fear, calmness, etc.
  • the scene information includes humor, horror, hilarity, pleasure, and the like.
  • the processing device may directly use the scene information as the video semantic information, or may organize the scene information as the video semantic information to conform to language expression habits.
  • the video semantic information may be a "pleasure scene", or may be such as “Wang Lihong smiles and sits on the sofa” and the like, “someone is doing something in a certain place", “someone Specific information such as "something is doing something", “what state is between something and something”.
  • the processing device may semantically combine the visual object and the object feature corresponding to the visual object to generate candidate video semantic information; and combine the candidate video according to the scene information.
  • the semantic information determines video semantic information corresponding to the video frame of the video.
  • the processing device may semantically combine the visual object and the object feature corresponding to the visual object, for example, in the above example, when the processing device determines that the visual object is “ “The sofa”, “Wang Lihong”, and the object characteristics of the "sofa” are "color is beige”, “Wang Leehom” object features "sit", “smile”, the processing device can semantically combine the above object features, Such as “sofa + beige”, “sofa + Wang Lee Hoon + sit”, “smile + sit” and so on.
  • the generated semantic combination described above is used as candidate video semantic information.
  • the above semantic combination may be a combination of two or two, or a combination of multiple object features; it may be a combination of multiple object features of the same visual object, or may be different visual objects.
  • a combination of multiple object features may be a combination of two or two, or a combination of multiple object features; it may be a combination of multiple object features of the same visual object, or may be different visual objects.
  • the processing device recombines or selects the candidate video semantic information according to the scenario information. Further, the candidate video semantic information may be polished according to the scenario information, thereby determining the Video semantic information corresponding to the video frame of the video.
  • the processing device will include "beige” and “smile” of "pleasure” as part of the video semantic information, thereby determining the video semantic information. Waiting for "Sitting on a beige sofa”.
  • the method further comprises a step S6 (not shown), wherein in step S6, the processing device acquires voice and/or subtitle information corresponding to the one or more video frames; in step S4 The processing device determines video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
  • step S6 the processing device acquires voice and/or subtitle information corresponding to the one or more video frames
  • step S4 The processing device determines video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
  • the processing device may further acquire the voice file or the subtitle file of the video corresponding to the video frame, or obtain the voice extraction or subtitle extraction of the video.
  • step S4 the processing device combines the scene information and the object feature information by using the voice and/or subtitle according to the scene information, in combination with the voice and/or subtitle information, to Generating video semantic information; or directly using the voice and/or subtitle information as part of the video semantic information; or using the voice and/or subtitle information to filter the generated candidate video semantic information, etc. , thereby determining the video semantic information.
  • the method further comprises a step S7 (not shown) and a step S8 (not shown); wherein, in step S7, the processing device acquires one or more video retrieval sequences; in step S8, The processing device matches the video retrieval sequence with the video semantic information to determine a target video corresponding to the video retrieval sequence.
  • step S7 the processing device acquires one or more video retrieval sequences by directly interacting with the user or with other devices capable of providing a video retrieval sequence; then, in step S8, the processing The device matches the video retrieval sequence with the video semantic information of each frame/continuous frame corresponding to each determined video, and if the video retrieval sequence matches the video semantic information, the video semantic information corresponds to Video as the target video.
  • processing device may further provide the target video to a user who sends the video retrieval sequence.
  • the present invention can be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device.
  • the software program of the present invention may be executed by a processor to implement the steps or functions described above.
  • the software program (including related data structures) of the present invention can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like.
  • some of the steps or functions of the present invention may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various steps or functions.
  • a portion of the invention can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or solution in accordance with the present invention.
  • the program instructions for invoking the method of the present invention may be stored in a fixed or removable recording medium and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a The working memory of the computer device in which the program instructions are run.
  • an embodiment in accordance with the present invention includes a device including a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, triggering
  • the apparatus operates based on the aforementioned methods and/or technical solutions in accordance with various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and device for acquiring video semantic information. Said method comprises, performing target extraction on video frames in a video, so as to determine visual objects contained in the video frames (S2), then determining, according to object features corresponding to the visual objects, scene information corresponding to the one or more video frames (S3), and finally determining, according to the scene information, video semantic information corresponding to the video frames of the video (S4). Compared with the prior art, the present method and apparatus can automatically acquire, on the basis of a video, detailed and complete video semantic information, saving a large amount of human resources for acquiring video semantic information, furthermore, the acquired video semantic information can facilitate subsequent analysis or searching of the video.

Description

一种用于获取视频语义信息的方法与装置Method and device for acquiring video semantic information
相关申请的交叉引用Cross-reference to related applications
本申请享有2018年1月25日提交的专利申请号为201810074371.2、名称为“一种用于获取视频语义信息的方法与装置”的中国专利申请的优先权,该在先申请的内容以引用方式合并于此。The present application is entitled to the priority of the Chinese Patent Application No. 201810074371.2 filed on Jan. 25, 20, the entire disclosure of which is incorporated herein by reference. Merge here.
技术领域Technical field
本发明涉及视频技术领域,尤其涉及一种用于获取视频语义信息的技术。The present invention relates to the field of video technologies, and in particular, to a technique for acquiring video semantic information.
背景技术Background technique
当前,获取视频内容的方法主要包括:根据视频介绍来获取,或是通过对视频内容进行分析后获取。前者主要是基于视频简介,而视频简介所覆盖的视频内容有限,无法反应视频内容的具体细节;后者则主要是对视频画面进行人物识别以及情绪识别,因此,所还原的视频信息有限,无法完整地还原视频所对应的具体语义信息。Currently, the method for obtaining video content mainly includes: obtaining according to a video introduction, or obtaining an analysis by analyzing the video content. The former is mainly based on video introduction, and the video content covered by the video introduction is limited, and cannot reflect the specific details of the video content; the latter mainly performs character recognition and emotion recognition on the video screen. Therefore, the restored video information is limited and cannot be Completely restore the specific semantic information corresponding to the video.
因此,如何能够获取详细的视频语义信息,进一步支持视频的应用,成为了本领域技术人员亟待解决的问题之一。Therefore, how to obtain detailed video semantic information and further support the application of video has become one of the problems to be solved by those skilled in the art.
发明内容Summary of the invention
本发明的目的是提供一种用于获取视频语义信息的方法与设备。It is an object of the present invention to provide a method and apparatus for acquiring video semantic information.
根据本发明的一个实施例,提供了一种用于获取视频语义信息的方法,其中,该方法包括以下步骤:According to an embodiment of the present invention, a method for acquiring video semantic information is provided, wherein the method comprises the following steps:
提取视频中的一个或多个视频帧;Extracting one or more video frames in the video;
对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象;Performing target extraction on the video frame to determine a visual object included in the video frame;
根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所 对应的场景信息;Determining scene information corresponding to the one or more video frames according to an object feature corresponding to the visual object;
根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息。Determining video semantic information corresponding to the video frame of the video according to the scene information.
可选地,提取视频中的一个或多个视频帧的步骤包括:Optionally, the step of extracting one or more video frames in the video comprises:
提取视频中的多个视频帧,其中,所述多个视频帧是连续的。A plurality of video frames in the video are extracted, wherein the plurality of video frames are contiguous.
可选地,该方法还包括:Optionally, the method further includes:
对视频进行场景分割;Perform scene segmentation on the video;
其中,提取视频中的多个视频帧的步骤包括:The step of extracting multiple video frames in the video includes:
根据所述视频中所对应的场景,提取所述视频中的多个视频帧,其中,所述多个视频帧是连续的且对应于同一场景。And extracting, according to the scene corresponding to the video, a plurality of video frames in the video, wherein the plurality of video frames are continuous and correspond to the same scene.
可选地,提取视频中的一个或多个视频帧的步骤包括:Optionally, the step of extracting one or more video frames in the video comprises:
当满足一个或多个触发条件时,提取视频中的一个或多个视频帧;Extracting one or more video frames in the video when one or more trigger conditions are met;
其中,所述触发条件包括以下至少任一项:The triggering condition includes at least one of the following:
根据所述视频的播放时间长度触发;Trigger according to the length of play time of the video;
根据所述视频的播放时间点触发;Trigger according to the playing time point of the video;
根据所述视频的一个或多个播放内容触发。Triggered according to one or more play content of the video.
可选地,确定所述视频帧中所包含的视觉对象的步骤包括:Optionally, the step of determining the visual object included in the video frame comprises:
对所述视频帧进行目标提取,结合所述视频的视频相关信息,以确定所述视频帧中所包含的视觉对象。Performing target extraction on the video frame, combining video related information of the video to determine a visual object included in the video frame.
可选地,确定所述一个或多个视频帧所对应的场景信息的步骤包括:Optionally, the step of determining scenario information corresponding to the one or more video frames includes:
确定每个所述视觉对象所对应的对象特征;Determining an object feature corresponding to each of the visual objects;
根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息。And determining, according to the association information between the object features corresponding to each of the visual objects, the scene information corresponding to the one or more video frames.
可选地,确定每个所述视觉对象所对应的对象特征的步骤包括:Optionally, the step of determining an object feature corresponding to each of the visual objects includes:
根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。Determining an object feature corresponding to each of the visual objects according to an object property of each of the visual objects.
可选地,确定所述视频的视频帧所对应的视频语义信息的步骤包括:Optionally, the step of determining video semantic information corresponding to the video frame of the video includes:
对所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,以生成候选视频语义信息;Semantically combining the visual object and object features corresponding to the visual object to generate candidate video semantic information;
根据所述场景信息,结合所述候选视频语义信息,确定所述视频的视频帧所对应的视频语义信息。And determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the candidate video semantic information.
可选地,该方法还包括:Optionally, the method further includes:
获取与所述一个或多个视频帧相对应的语音和/或字幕信息;Acquiring voice and/or subtitle information corresponding to the one or more video frames;
其中,确定所述视频的视频帧所对应的视频语义信息的步骤包括:The step of determining video semantic information corresponding to the video frame of the video includes:
根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息。Determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
可选地,该方法还包括:Optionally, the method further includes:
获取一个或多个视频检索序列;Obtain one or more video retrieval sequences;
将所述视频检索序列与所述视频语义信息进行匹配,以确定所述视频检索序列所对应的目标视频。The video retrieval sequence is matched with the video semantic information to determine a target video corresponding to the video retrieval sequence.
根据本发明的另一个实施例,还提供了一种用于获取视频语义信息的处理设备,其中,所述处理设备包括:According to another embodiment of the present invention, a processing device for acquiring video semantic information is further provided, where the processing device includes:
用于提取视频中的一个或多个视频帧的装置;Means for extracting one or more video frames in a video;
用于对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象的装置;Means for performing target extraction on the video frame to determine a visual object included in the video frame;
用于根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所对应的场景信息的装置;Means for determining scene information corresponding to the one or more video frames according to an object feature corresponding to the visual object;
用于根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息的装置。Means for determining video semantic information corresponding to a video frame of the video according to the scene information.
可选地,用于提取视频中的一个或多个视频帧的装置用于:Optionally, means for extracting one or more video frames in the video is used to:
提取视频中的多个视频帧,其中,所述多个视频帧是连续的。A plurality of video frames in the video are extracted, wherein the plurality of video frames are contiguous.
可选地,所述处理设备还包括:Optionally, the processing device further includes:
用于对视频进行场景分割的装置;Means for performing scene segmentation on a video;
其中,用于提取视频中的多个视频帧的装置用于:Wherein the means for extracting a plurality of video frames in the video is used to:
根据所述视频中所对应的场景,提取所述视频中的多个视频帧,其中,所述多个视频帧是连续的且对应于同一场景。And extracting, according to the scene corresponding to the video, a plurality of video frames in the video, wherein the plurality of video frames are continuous and correspond to the same scene.
可选地,用于提取视频中的一个或多个视频帧的装置用于:Optionally, means for extracting one or more video frames in the video is used to:
当满足一个或多个触发条件时,提取视频中的一个或多个视频帧;Extracting one or more video frames in the video when one or more trigger conditions are met;
其中,所述触发条件包括以下至少任一项:The triggering condition includes at least one of the following:
根据所述视频的播放时间长度触发;Trigger according to the length of play time of the video;
根据所述视频的播放时间点触发;Trigger according to the playing time point of the video;
根据所述视频的一个或多个播放内容触发。Triggered according to one or more play content of the video.
可选地,用于确定所述视频帧中所包含的视觉对象的装置用于:Optionally, means for determining a visual object contained in the video frame is for:
对所述视频帧进行目标提取,结合所述视频的视频相关信息,以确定所述视频帧中所包含的视觉对象。Performing target extraction on the video frame, combining video related information of the video to determine a visual object included in the video frame.
可选地,用于确定所述一个或多个视频帧所对应的场景信息的装置包括:Optionally, the apparatus for determining scenario information corresponding to the one or more video frames includes:
用于确定每个所述视觉对象所对应的对象特征的单元;a unit for determining an object feature corresponding to each of the visual objects;
用于根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息的单元。And determining, according to the association information between the object features corresponding to each of the visual objects, a unit of the scene information corresponding to the one or more video frames.
可选地,用于确定每个所述视觉对象所对应的对象特征的单元用于:Optionally, a unit for determining an object feature corresponding to each of the visual objects is used to:
根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。Determining an object feature corresponding to each of the visual objects according to an object property of each of the visual objects.
可选地,用于确定所述视频的视频帧所对应的视频语义信息的装置用于:Optionally, means for determining video semantic information corresponding to the video frame of the video is used to:
对所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,以生成候选视频语义信息;Semantically combining the visual object and object features corresponding to the visual object to generate candidate video semantic information;
根据所述场景信息,结合所述候选视频语义信息,确定所述视频的视频帧所对应的视频语义信息。And determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the candidate video semantic information.
可选地,所述处理设备还包括:Optionally, the processing device further includes:
用于获取与所述一个或多个视频帧相对应的语音和/或字幕信息的装置;Means for acquiring voice and/or caption information corresponding to the one or more video frames;
其中,用于确定所述视频的视频帧所对应的视频语义信息的装置用于:Wherein, the means for determining video semantic information corresponding to the video frame of the video is used for:
根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息。Determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
可选地,所述处理设备还包括:Optionally, the processing device further includes:
用于获取一个或多个视频检索序列的装置;Means for obtaining one or more video retrieval sequences;
用于将所述视频检索序列与所述视频语义信息进行匹配,以确定所述视 频检索序列所对应的目标视频的装置。Means for matching the video search sequence with the video semantic information to determine a target video corresponding to the video search sequence.
根据本发明的另一个实施例,还提供了一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机可读指令,当所述计算机可读指令被一个或多个设备执行时,使得所述设备执行如上述所述的方法。According to another embodiment of the present invention, there is also provided a computer readable storage medium, wherein the computer storage medium stores computer readable instructions that are executed by one or more devices The apparatus is caused to perform the method as described above.
根据本发明的另一个实施例,还提供了一种获取视频语义信息的***,其特征在于,所述***包括存储器和处理器,所述存储器中存储有计算机可读指令,当所述计算机可读指令被所述处理器执行时,所述处理器执行如上述所述的方法。According to another embodiment of the present invention, there is also provided a system for acquiring video semantic information, characterized in that the system comprises a memory and a processor, wherein the memory stores computer readable instructions, when the computer is When a read instruction is executed by the processor, the processor performs the method as described above.
与现有技术相比,本发明通过对视频中的视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象,然后根据所述视觉对象所对应的对象特征,确定所述一个或多个视频帧所对应的场景信息,最后根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息;从而本发明能够基于视频自动获取详细完整的视频语义信息,节省了用于视频语义获取的大量人工资源,同时,所获取的视频语义信息能够便于后续对于视频的分析或搜索等。Compared with the prior art, the present invention determines the one or the visual object included in the video frame by performing target extraction on the video frame in the video, and then determining the one or the image according to the object feature corresponding to the visual object. The scene information corresponding to the plurality of video frames is determined, and finally, the video semantic information corresponding to the video frame of the video is determined according to the scene information; thus, the invention can automatically obtain detailed and complete video semantic information based on the video, thereby saving A large number of artificial resources obtained by video semantics, and the acquired video semantic information can facilitate subsequent analysis or search for video.
而且,本发明还能够对连续的视频帧或对应于同一场景的连续视频帧进行分析,从而使得所获取的视频语义信息更加完整准确。Moreover, the present invention is also capable of analyzing continuous video frames or consecutive video frames corresponding to the same scene, thereby making the acquired video semantic information more complete and accurate.
而且,本发明还能够基于不同的触发条件,来提取视频中的一个或多个视频帧,从而实现了对于视频帧的定向提取,进而实现了全自动的视频语义分析,便于对视频语义信息的定向分析与获取,提高了处理效率,节约了大量的人工资源。Moreover, the present invention can also extract one or more video frames in the video based on different trigger conditions, thereby realizing the directional extraction of the video frames, thereby implementing fully automatic video semantic analysis and facilitating video semantic information. Directional analysis and acquisition improve processing efficiency and save a lot of human resources.
而且,本发明还能够结合视频的视频相关信息,来确定所述视频帧中所包含的视觉对象,从而使得所确定的视觉对象更加准确,进一步提升了所获得的视频语义信息的准确性。Moreover, the present invention is also capable of combining the video related information of the video to determine the visual object contained in the video frame, thereby making the determined visual object more accurate, and further improving the accuracy of the obtained video semantic information.
而且,本发明还能够根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息;进一步地,还能够根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。从而,本发明提高了所获取的场景信息的准确性,进一步提升了所获得 的视频语义信息的准确性。Moreover, the present invention is also capable of determining scene information corresponding to the one or more video frames according to association information between object features corresponding to each of the visual objects; further, according to each of the visions An object attribute of the object, determining an object feature corresponding to each of the visual objects. Thus, the present invention improves the accuracy of the acquired scene information and further improves the accuracy of the obtained video semantic information.
而且,本发明还能够根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息,从而提高了所获取的视频语义信息的准确性。Moreover, the present invention is further capable of determining video semantic information corresponding to a video frame of the video according to the scene information, in combination with the voice and/or subtitle information, thereby improving accuracy of the acquired video semantic information.
附图说明DRAWINGS
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:Other features, objects, and advantages of the present invention will become more apparent from the Detailed Description of Description
图1示出根据本发明一个方面的一种用于获取视频语义信息的处理设备示意图;1 shows a schematic diagram of a processing device for acquiring video semantic information, in accordance with an aspect of the present invention;
图2示出根据本发明另一个方面的一种用于获取视频语义信息的方法流程图。2 shows a flow chart of a method for acquiring video semantic information in accordance with another aspect of the present invention.
附图中相同或相似的附图标记代表相同或相似的部件。The same or similar reference numerals in the drawings denote the same or similar components.
具体实施方式Detailed ways
在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此外,各项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。Before discussing the exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as a process or method depicted as a flowchart. Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. In addition, the order of operations can be rearranged. The process may be terminated when its operation is completed, but may also have additional steps not included in the figures. The processing may correspond to methods, functions, procedures, subroutines, subroutines, and the like.
在上下文中所称的“处理设备”,即为“计算机设备”,也称为“电脑”,是指可以通过运行预定程序或指令来执行数值计算和/或逻辑计算等预定处理过程的智能电子设备,其可以包括处理器与存储器,由处理器执行在存储器中预存的存续指令来执行预定处理过程,或是由ASIC、FPGA、DSP等硬件执行预定处理过程,或是由上述二者组合来实现。The term "processing device" as used in the context, that is, "computer device", also referred to as "computer", refers to a smart electronic device that can perform predetermined processing such as numerical calculations and/or logical calculations by running a predetermined program or instruction. The device, which may include a processor and a memory, the processor executes a pre-stored instruction stored in the memory to perform a predetermined process, or performs a predetermined process by hardware such as an ASIC, an FPGA, a DSP, or the like, or a combination of the two. achieve.
所述计算机设备包括用户设备和/或网络设备。其中,所述用户设备包括但不限于电脑、智能手机、PDA等;所述网络设备包括但不限于单个网 络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量计算机或网络服务器构成的云,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。其中,所述计算机设备可单独运行来实现本发明,也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。其中,所述计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。The computer device includes a user device and/or a network device. The user equipment includes, but is not limited to, a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing based computer Or a cloud composed of a network server, wherein cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computers. Wherein, the computer device can be operated separately to implement the present invention, and can also access the network and implement the present invention by interacting with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
本领域技术人员应能理解,一般情况下,本发明中所述的“处理设备”可以仅是网络设备,即由网络设备来执行相应的操作;在特殊情况下,也可以是由用户设备与网络设备或服务器相集成来组成,即由用户设备与网络设备相配合来执行相应的操作,例如,由用户设备向网络设备发送指令,以指示网络设备开始执行“获取视频语义信息”的相应操作。It should be understood by those skilled in the art that, in general, the "processing device" described in the present invention may be only a network device, that is, a network device performs a corresponding operation; in a special case, it may also be a user device and The network device or the server is integrated, that is, the user equipment cooperates with the network device to perform a corresponding operation, for example, the user equipment sends an instruction to the network device to instruct the network device to start performing the corresponding operation of “acquiring the video semantic information”. .
需要说明的是,所述用户设备、网络设备和网络等仅为举例,其他现有的或今后可能出现的计算机设备或网络如可适用于本发明,也应包含在本发明保护范围以内,并以引用方式包含于此。It should be noted that the user equipment, the network equipment, the network, and the like are merely examples, and other existing or future possible computer equipment or networks, such as those applicable to the present invention, are also included in the scope of the present invention. It is included here by reference.
这里所公开的具体结构和功能细节仅仅是代表性的,并且是用于描述本发明的示例性实施例的目的。但是本发明可以通过许多替换形式来具体实现,并且不应当被解释成仅仅受限于这里所阐述的实施例。The specific structural and functional details disclosed are merely representative and are for the purpose of describing exemplary embodiments of the invention. The present invention may, however, be embodied in many alternative forms and should not be construed as being limited only to the embodiments set forth herein.
应当理解的是,虽然在这里可能使用了术语“第一”、“第二”等等来描述各个单元,但是这些单元不应当受这些术语限制。使用这些术语仅仅是为了将一个单元与另一个单元进行区分。举例来说,在不背离示例性实施例的范围的情况下,第一单元可以被称为第二单元,并且类似地第二单元可以被称为第一单元。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。It should be understood that although the terms "first," "second," etc. may be used herein to describe the various elements, these elements should not be limited by these terms. These terms are used only to distinguish one unit from another. For example, a first unit could be termed a second unit, and similarly a second unit could be termed a first unit, without departing from the scope of the exemplary embodiments. The term "and/or" used herein includes any and all combinations of one or more of the associated listed items.
这里所使用的术语仅仅是为了描述具体实施例而不意图限制示例性实施例。除非上下文明确地另有所指,否则这里所使用的单数形式“一个”、“一项”还意图包括复数。还应当理解的是,这里所使用的术语“包括”和/或“包含”规定所陈述的特征、整数、步骤、操作、单元和/或组件的存在, 而不排除存在或添加一个或更多其他特征、整数、步骤、操作、单元、组件和/或其组合。The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It is also to be understood that the terms "comprises" and / or "comprising", as used herein, are intended to mean the existence of the recited features, integers, steps, operations, units and/or components, and do not exclude the presence or addition of one or more Other features, integers, steps, operations, units, components, and/or combinations thereof.
还应当提到的是,在一些替换实现方式中,所提到的功能/动作可以按照不同于附图中标示的顺序发生。举例来说,取决于所涉及的功能/动作,相继示出的两幅图实际上可以基本上同时执行或者有时可以按照相反的顺序来执行。It should also be noted that, in some alternative implementations, the functions/acts noted may occur in a different order than that illustrated in the drawings. For example, two figures shown in succession may in fact be executed substantially concurrently or sometimes in the reverse order, depending on the function/acts involved.
下面结合附图对本发明作进一步详细描述。The invention is further described in detail below with reference to the accompanying drawings.
图1示出根据本发明一个方面的一种用于获取视频语义信息的处理设备示意图;其中,所述处理设备包括用于提取视频中的一个或多个视频帧的装置(以下简称“第一装置1”);用于对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象的装置(以下简称“第二装置2”);用于根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所对应的场景信息的装置(以下简称“第三装置3”);用于根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息的装置(以下简称“第四装置4”)。1 shows a schematic diagram of a processing device for acquiring video semantic information in accordance with an aspect of the present invention; wherein the processing device includes means for extracting one or more video frames in a video (hereinafter referred to as "first" Apparatus 1") means for performing target extraction on the video frame to determine a visual object included in the video frame (hereinafter referred to as "second apparatus 2"); for corresponding to the visual object Means for determining scene information corresponding to the one or more video frames (hereinafter referred to as "third device 3"); for determining, according to the scene information, a video frame corresponding to the video A device for video semantic information (hereinafter referred to as "fourth device 4").
所述第一装置1提取视频中的一个或多个视频帧。The first device 1 extracts one or more video frames in the video.
具体地,所述第一装置1通过自动提取或是人工指定提取等方式,从待分析的视频中,提取出一个或多个视频帧。本领域技术人员应能理解,所提取的视频帧可以是一张或多张,也可以是连续或不连续。当所述第一装置1提取了视频帧后,后续的装置即对所提取的视频帧进行视频语义信息的分析。Specifically, the first device 1 extracts one or more video frames from the video to be analyzed by means of automatic extraction or manual designation extraction. Those skilled in the art will appreciate that the extracted video frames may be one or more, or may be continuous or discontinuous. After the first device 1 extracts the video frame, the subsequent device performs video semantic information analysis on the extracted video frame.
优选地,所述第一装置1提取视频中的多个视频帧,其中,所述多个视频帧是连续的。也即,所述第一装置1可以同时提取同一视频中的多个连续的视频帧;在此,所述连续的视频帧,即为相邻两个视频帧的播放时间相差在一定阈值范围内的多个视频帧。当所述第一装置1提取了连续视频帧后,后续的装置即对所提取的连续视频帧进行视频语义信息的分析。Preferably, the first device 1 extracts a plurality of video frames in a video, wherein the plurality of video frames are continuous. That is, the first device 1 can simultaneously extract multiple consecutive video frames in the same video; here, the consecutive video frames, that is, the playing time of two adjacent video frames are within a certain threshold range Multiple video frames. After the first device 1 extracts the continuous video frame, the subsequent device performs video semantic information analysis on the extracted continuous video frames.
在此,所执行的分析可以将连续的视频帧作为一个整体进行分析,从而对多个连续的视频帧得到一个视频语义信息;也可以将连续的视频帧分别分析,从而对多个连续的视频帧得到多个视频语义信息。Here, the performed analysis can analyze continuous video frames as a whole to obtain one video semantic information for multiple consecutive video frames; or analyze successive video frames separately for multiple consecutive videos. The frame gets multiple video semantic information.
更优选地,所述处理设备还包括用于对视频进行场景分割的装置(以下简称“第五装置”,未示出)。其中,所述第五装置对视频进行场景分割;然后,所述第一装置1根据所述视频中所对应的场景,提取所述视频中的多个视频帧,其中,所述多个视频帧是连续的且对应于同一场景。More preferably, the processing device further includes means for scene segmentation of the video (hereinafter referred to as "fifth device", not shown). The fifth device performs scene segmentation on the video. The first device 1 then extracts multiple video frames in the video according to the scene corresponding to the video, where the multiple video frames are It is continuous and corresponds to the same scene.
具体地,所述第五装置根据视频的时间、色彩、人物变更等,对所述视频进行场景分割。例如,将所述视频按照每一分钟为一个场景的默认设置,将该视频的每一分钟作为一个场景以进行分割;或者,根据所述视频中的色彩变化,如从冷色调切换为暖色调时,则认为场景发生变化,从而执行场景分割;或者,根据所述视频中的人物变更,如画面中从两个人变成三个人,则认为场景发生了变化,从而执行场景分割。Specifically, the fifth device performs scene segmentation on the video according to time, color, character change, and the like of the video. For example, the video is set to a default setting of one scene per minute, and each minute of the video is divided into one scene for segmentation; or, according to color changes in the video, such as switching from cool to warm When the scene is changed, the scene is divided to perform scene segmentation; or, according to the person change in the video, if the screen changes from two people to three people, the scene is considered to have changed, thereby performing scene segmentation.
本领域技术人员应能理解,其他的场景分割方法也适用于本发明,以供本发明对所述视频进行场景分割。Those skilled in the art will appreciate that other scene segmentation methods are also applicable to the present invention for scene segmentation of the video by the present invention.
然后,所述第一装置1根据所分割后的视频,按照每个场景,对所述视频进行视频帧的提取,换言之,所述第一装置1可以对所述多个场景中的至少一个来提取视频帧,所提取的视频帧是连续的且对应于该场景。Then, the first device 1 performs video frame extraction on the video according to each scene according to the divided video, in other words, the first device 1 may access at least one of the multiple scenes. The video frames are extracted and the extracted video frames are contiguous and correspond to the scene.
优选地,当满足一个或多个触发条件时,所述第一装置1提取视频中的一个或多个视频帧。Preferably, the first device 1 extracts one or more video frames in the video when one or more trigger conditions are met.
具体地,所述第一装置1可以实时地或是基于事件触发地,对所述视频进行检测,判断当前视频是否满足触发条件;当满足所述触发条件时,所述第一装置1则提取视频中的一个或多个视频帧。Specifically, the first device 1 may detect the video in real time or based on an event, and determine whether the current video meets a trigger condition; when the trigger condition is met, the first device 1 extracts One or more video frames in the video.
其中,所述触发条件包括以下至少任一项:The triggering condition includes at least one of the following:
根据所述视频的播放时间长度触发:例如,当所述视频的播放时间长度在第五分钟时,则执行视频帧的提取,在此,若某一用户从该视频的起始阶段(即0分0秒)开始播放,则表示在第五分钟(即5分0秒)进行视频帧的提取,若某一用户从该视频的第十分钟(即10分0秒)开始播放,则表示在第十五分钟(即15分0秒)进行视频帧的提取。Triggering according to the length of the playing time of the video: for example, when the playing time length of the video is in the fifth minute, the extraction of the video frame is performed, where if a certain user starts from the beginning of the video (ie, 0) Start playing in 0 seconds), indicating that the video frame is extracted in the fifth minute (ie, 5 minutes and 0 seconds). If a user starts playing from the tenth minute of the video (ie, 10 minutes and 0 seconds), it means that The video frame is extracted at the fifteenth minute (ie, 15 minutes and 0 seconds).
根据所述视频的播放时间点触发:例如,在所述视频的第三分钟、第五分钟和第七分钟触发视频帧的提取。Triggering according to the playback time point of the video: for example, the extraction of the video frame is triggered at the third, fifth, and seventh minutes of the video.
根据所述视频的一个或多个播放内容触发:其中,所述播放内容包括但不限于语音、人物、物品、图案等。例如,当所述视频出现语音时,则触发对视频帧的提取;当所述视频出现人物“曹操”时,则触发对视频帧的提取;当所述视频出现物品“可口可乐”时,则触发对视频帧的提取;当所述视频出现“CCTV”的LOGO时,则触发对视频帧的提取等。Triggering according to one or more play content of the video: wherein the play content includes, but is not limited to, a voice, a person, an item, a pattern, and the like. For example, when the video has a voice, the extraction of the video frame is triggered; when the character “Cao Cao” appears in the video, the extraction of the video frame is triggered; when the video “Coca-Cola” appears in the video, the trigger is triggered. The extraction of the video frame; when the video shows the LOGO of "CCTV", the extraction of the video frame and the like are triggered.
在此,所述第一装置1可以基于缺省设置,来决定“当触发对视频帧的提取时,提取多少个视频帧、所提取的视频帧所对应的视频长度”等;也可以基于所述触发条件,来确定提取视频帧的参数设置,例如,提取频率(一秒内提取几帧)、提取数量(一共提取多少个视频帧)、所提取的视频帧所对应的视频长度(如当从某一时间点开始提取视频帧后,再到哪一时间点停止提取)、提取哪些视频帧等。例如,不同的触发条件可以对应不同的参数设置,如若触发视频帧提取的播放内容不同,则所对应的提取视频帧的参数设置也不同。Here, the first device 1 may determine, according to a default setting, "how many video frames are extracted, the video length corresponding to the extracted video frame when triggering the extraction of the video frame", etc.; Determining the trigger condition to determine the parameter settings of the extracted video frame, for example, extracting the frequency (fetching a few frames in one second), extracting the number (how many video frames are extracted), and the video length corresponding to the extracted video frame (eg, when After extracting a video frame from a certain point in time, at which point in time, the extraction is stopped, which video frames are extracted, and the like. For example, different trigger conditions may correspond to different parameter settings. For example, if the playback content of the triggered video frame extraction is different, the parameter settings of the corresponding extracted video frames are also different.
所述第二装置2对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象。The second device 2 performs target extraction on the video frame to determine a visual object included in the video frame.
具体地,所述第二装置2通过对所述视频帧进行图像处理,基于各类图像处理算法,如通过对所述视频帧进行纹理提取、色彩分析等方式,提取所述视频帧中所包含的目标;然后,通过将所提取的目标与一个或多个对象模型进行比较,以确定所述目标所对应的视觉对象,进而确定所述视频帧中所包含的视觉对象。Specifically, the second device 2 extracts the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. a target; then, by comparing the extracted target with one or more object models to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame.
在此,所述视觉对象包括但不限于人物、物品、背景、图标等。Here, the visual objects include, but are not limited to, people, objects, backgrounds, icons, and the like.
例如,若对某个电视剧的视频帧进行目标提取,则可以确定该视频帧中包含四个目标,然后,将这四个目标与现有的对象模型进行匹配,可以确定这四个目标分别是:“两个人物、一个沙发、一个茶几”。在此,所述的对象模型可以是预设置的,也可以是基于机器学习的方式不断学习获取的。For example, if a video frame of a TV drama is subjected to target extraction, it can be determined that the video frame contains four targets, and then the four targets are matched with the existing object model, and it can be determined that the four targets are respectively : "Two characters, a sofa, a coffee table." Here, the object model may be pre-set, or may be continuously learned and acquired based on a machine learning method.
优选地,所述第二装置2可以对所述视频帧进行目标提取,结合所述视频的视频相关信息,以确定所述视频帧中所包含的视觉对象。Preferably, the second device 2 may perform target extraction on the video frame, and combine video related information of the video to determine a visual object included in the video frame.
具体地,所述视频相关信息包括但不限于视频内容简介、视频主要人物简介、视频相关搜索结果、视频出品方/作者等。Specifically, the video related information includes, but is not limited to, a video content introduction, a video main character profile, a video related search result, a video producer/author, and the like.
所述第二装置2通过对所述视频帧进行图像处理,基于各类图像处理算法,如通过对所述视频帧进行纹理提取、色彩分析等方式,提取所述视频帧中所包含的目标。然后,通过将所提取的目标与一个或多个与所述视频相关信息关联的对象模型进行匹配,以确定所述目标所对应的视觉对象,进而确定所述视频帧中所包含的视觉对象;或者,先将将所提取的目标与一个或多个通用对象模型进行匹配,当确定了所述目标的所述类别(如人物、物品、背景、图标等类别)后,将所述目标与该视频信息相关联的该类别中的具体模型进行匹配,以确定所述目标所对应的视觉对象,进而确定所述视频帧中所包含的视觉对象。The second device 2 extracts the target included in the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. Then, by matching the extracted target with one or more object models associated with the video related information to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame; Alternatively, the extracted target is first matched with one or more common object models, and when the category of the target (such as a character, an item, a background, an icon, etc.) is determined, the target is A specific model in the category associated with the video information is matched to determine a visual object corresponding to the target, thereby determining a visual object contained in the video frame.
例如,若所述视频相关信息为该视频的演员表,则将所提取的目标(主要为人物类目标)与所述演员表中的各个演员的图像模型进行匹配,以确定所提取的目标是否是该演员;或者,将所提取的目标与通用模型进行匹配,首先确定该目标的类别是“人物”,然后,将所述人物目标与该视频的各个演员的图像模型进行匹配,以确定所提取的目标是否是该演员。For example, if the video related information is a cast of the video, the extracted target (mainly a character-like target) is matched with an image model of each actor in the cast, to determine whether the extracted target is Is the actor; or, matching the extracted target with the general model, first determining that the target's category is "person", and then matching the person target with the image model of each actor of the video to determine Whether the target of the extraction is the actor.
所述第三装置3根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所对应的场景信息。The third device 3 determines the scene information corresponding to the one or more video frames according to the object features corresponding to the visual object.
具体地,所述第三装置3通过对所述视觉对象的图像特征进行分析,以确定所述视觉对象所对应的对象特征。Specifically, the third device 3 determines an object feature corresponding to the visual object by analyzing an image feature of the visual object.
其中,所述对象特征包括但不限于动作、情绪、色彩、位置等,进一步地,对于同一视频帧内的多个视觉对象,所述对象特征还包括多个视觉对象之间的交互或关联关系。The object features include, but are not limited to, actions, emotions, colors, positions, and the like. Further, for a plurality of visual objects within the same video frame, the object features further include interactions or associations between the plurality of visual objects. .
然后,所述第三装置3根据所述对象特征,对多个所述对象特征间的相互关系进行分析,以确定所述一个或多个视频帧所对应的场景信息。在此,所述场景信息的确定,可以基于预先设定的对象特征模型来进行分析,所述对象特征模型中包括多种对象特征或其组合与不同的场景信息的映射关系。Then, the third device 3 analyzes the mutual relationship between the plurality of object features according to the object feature to determine scene information corresponding to the one or more video frames. Here, the determination of the scene information may be performed based on a preset object feature model, where the object feature model includes a mapping relationship between multiple object features or a combination thereof and different scene information.
其中,所述场景信息包括但不限于如诙谐、恐怖、搞笑、愉悦等表述场景状态或场景氛围的信息。The scene information includes, but is not limited to, information such as humor, horror, hilarity, pleasure, and the like that express a scene state or a scene atmosphere.
例如,所述第二装置2确定所述视觉对象为“沙发”、“王力宏”,经过所 述第三装置3对上述视觉对象的分析,“沙发”的对象特征为“色彩为米色”,“王力宏”的对象特征为“坐”、“微笑”,则所述第三装置3基于对象特征模型中对“沙发”、“米色”、“坐”、“微笑”、“王力宏(人物)”的映射关系,通过基于对其中的“色彩情绪”、“表情情绪”等的分析,确定其场景信息为“愉悦”。For example, the second device 2 determines that the visual object is "sofa" or "Wang Lee Hom", and the object of the "sofa" is characterized by "color is beige" through the analysis of the visual object by the third device 3. The object features of "Wang Lee Hom" are "sitting" and "smile", and the third device 3 is based on the "sofa", "beige", "sitting", "smile", "Wang Lihong (person)" in the object feature model. The mapping relationship is determined to be "pleasure" by analyzing the "color emotion", "expression emotion", and the like.
在此,本领域技术人员应能理解,当所述第三装置3对所述对象特征进行分析时,可以先基于两两对象特征进行分析,或将两两对象特征与所述对象特征模型中的映射关系进行匹配;从而,经过多轮的分析或匹配后,最终得到最后的场景信息。此外,所述第三装置3也可以直接将全部对象特征进行分析与匹配,从而直接得到场景信息。Here, those skilled in the art should understand that when the third device 3 analyzes the object features, it may first analyze based on the two-two object features, or combine the two-two object features with the object feature model. The mapping relationship is matched; thus, after multiple rounds of analysis or matching, the final scene information is finally obtained. In addition, the third device 3 can directly analyze and match all the object features, thereby directly obtaining the scene information.
优选地,所述第三装置3包括用于确定每个所述视觉对象所对应的对象特征的单元(以下简称“三一单元”,未示出),以及用于根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息的单元(以下简称“三二单元”,未示出)。Preferably, the third device 3 includes means for determining an object feature corresponding to each of the visual objects (hereinafter referred to as "three-one unit", not shown), and for each of the visual objects Corresponding information between the corresponding object features, a unit for determining scene information corresponding to the one or more video frames (hereinafter referred to as "three two units", not shown).
具体地,所述三一单元对每个所述视觉对象的图像特征分别进行分析,从而分别确定所述视觉对象所对应的对象特征。Specifically, the three-unit unit separately analyzes image features of each of the visual objects, thereby respectively determining object features corresponding to the visual objects.
更优选地,所述三一单元可以根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。More preferably, the three unit may determine an object feature corresponding to each of the visual objects according to an object attribute of each of the visual objects.
具体地,所述三一单元首先基于所述视觉对象的名称等,确定每个视觉对象所对应的对象属性;其中,所述对象属性包括但不限于所述视觉对象的各种类别。Specifically, the three-one unit first determines an object attribute corresponding to each visual object based on a name of the visual object or the like; wherein the object attribute includes, but is not limited to, various categories of the visual object.
例如,当所述视觉对象为“沙发”时,其所对应的对象属性为“家具”,相应地,“家具”所具有的对象特征应当包括“颜色、形状、大小”等;例如,当所述视觉对象为“王力宏”时,其所对应的对象属性为“人物”和/或“娱乐明星”,则相应地,“人物”所具有的对象特征为“性别、表情、动作、服饰”等,“娱乐明星”所具有的对象特征为“姓名”等。For example, when the visual object is a “sofa”, the object attribute corresponding to it is “furniture”, and correspondingly, the object feature of “furniture” should include “color, shape, size”, etc.; for example, When the visual object is "Wang Lihong", the corresponding object attribute is "person" and/or "entertainment star", and accordingly, the "person" has the object characteristics of "gender, expression, action, costume", etc. The object features of "Entertainment Stars" are "name" and the like.
然后,所述三一单元根据所确定的对象属性所要求的对象特征,对于所述具体的视觉对象,按照上述对象特征进行分析,以得出该视觉对象所对应的对象特征。Then, the trin unit analyzes the specific visual object according to the determined object feature according to the determined object attribute to obtain the object feature corresponding to the visual object.
例如,“沙发”的对象特征即为“颜色米色、L形沙发、大”,“王力宏”的对象特征即为“性别男、表情微笑、动作坐下、服饰衬衫、姓名王力宏”。For example, the object characteristics of "sofa" is "color beige, L-shaped sofa, large", and the object characteristics of "Wang Lihong" are "sex male, smiling expression, sitting down, dress shirt, name Wang Lihong".
然后,所述三二单元根据每个视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息。在此,所述关联性信息可以基于预先设定的关联模型进行得出;所述关联模型中包括两种或多种不同的对象特征间的关联性信息,这种关联性信息可以基于对大量数据的分析与训练得到。例如,“沙发”和“坐下”具有关联性信息,“微笑”和“坐下”也具有关联性信息,“米色”和“微笑”在情绪上也具有关联性信息(即都表示愉悦),从而,继上例,所确定的场景信息为“愉悦”。Then, the third unit determines the scene information corresponding to the one or more video frames according to the association information between the object features corresponding to each of the visual objects. Here, the association information may be derived based on a preset association model; the association model includes association information between two or more different object features, and the association information may be based on a large number of Data analysis and training were obtained. For example, “sofa” and “sit down” have associated information, “smile” and “sit down” also have relevance information, and “beige” and “smile” also have emotionally related information (ie, both express pleasure). Thus, following the above example, the determined scene information is "pleasant".
所述第四装置4根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息。The fourth device 4 determines video semantic information corresponding to the video frame of the video according to the scene information.
具体地,所述第四装置4根据所确定的场景信息,将该场景信息转换为“文字或语音”等形式的信息,以作为所述视频的视频帧所对应的视频语义信息。其中,所述视频语义信息包括但不限于场景概况、场景状态和/或场景细节信息等,如时间、地点、人物、动作、情绪等。其中,所述情绪包括开心、生气、恐惧、平静等,所述场景信息包括诙谐、恐怖、搞笑、愉悦等。Specifically, the fourth device 4 converts the scene information into information in the form of “text or voice” according to the determined scene information, so as to be the video semantic information corresponding to the video frame of the video. The video semantic information includes, but is not limited to, a scene profile, a scene state, and/or scene detail information, such as time, place, person, action, mood, and the like. The emotions include happiness, anger, fear, calmness, etc., and the scene information includes humor, horror, hilarity, pleasure, and the like.
在此,所述第四装置4可以直接将所述场景信息作为所述视频语义信息,也可以将所述场景信息进行整理以作为所述视频语义信息,以使其符合语言表达习惯,例如,继上例,所述视频语义信息可以是“愉悦的场景”,也可以是如“王力宏微笑着坐在沙发上”等表示“某人在某地做某事”、“某人以某种情绪做某事”、“某物和某物之间处于何种状态”等的具体信息。Here, the fourth device 4 may directly use the scene information as the video semantic information, or may organize the scene information as the video semantic information to conform to language expression habits, for example, According to the above example, the video semantic information may be a "pleasure scene", or may be such as "Wang Lihong smiles and sits on the sofa" and the like, "someone is doing something in a certain place", "someone has some emotion Specific information such as "doing something", "what state is between something and something".
优选地,所述第四装置4可以对所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,以生成候选视频语义信息;根据所述场景信息,结合所述候选视频语义信息,确定所述视频的视频帧所对应的视频语义信息。Preferably, the fourth device 4 may semantically combine the visual object and the object features corresponding to the visual object to generate candidate video semantic information; and combine the candidate video semantic information according to the scene information. Determining video semantic information corresponding to the video frame of the video.
具体地,所述第四装置4可以将所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,例如,继上例,当所述第二装置2确定所述视觉对象为“沙发”、“王力宏”,且“沙发”的对象特征为“色彩为米色”、“王力宏”的对象特征为“坐”、“微笑”时,所述第四装置4可以将上述对象特征进行语 义组合,如“沙发+米色”、“沙发+王力宏+坐”、“微笑+坐”等。所生成的上述语义组合,则作为候选视频语义信息。Specifically, the fourth device 4 may semantically combine the visual object and the object feature corresponding to the visual object, for example, in the above example, when the second device 2 determines that the visual object is a “sofa” "Wang Lihong", and the object feature of "sofa" is "color is beige", "Wang Leehom" object features "sit", "smile", the fourth device 4 can semantically combine the above object features Such as "sofa + beige", "sofa + Wang Lihong + sitting", "smile + sit" and so on. The generated semantic combination described above is used as candidate video semantic information.
在此,本领域技术人员应能理解,上述语义组合可以是两两组合,也可以是多个对象特征的组合;可以是同一视觉对象的多个对象特征间的组合,也可以是不同视觉对象的多个对象特征间的组合。Here, those skilled in the art should understand that the above semantic combination may be a combination of two or two, or a combination of multiple object features; it may be a combination of multiple object features of the same visual object, or may be different visual objects. A combination of multiple object features.
然后,所述第四装置4根据所述场景信息,对所述候选视频语义信息进行重新组合或挑选;进一步地,还可以根据所述场景信息,对所述候选视频语义信息进行润色,从而确定所述视频的视频帧所对应的视频语义信息。Then, the fourth device 4 recombines or selects the candidate video semantic information according to the scene information; further, the candidate video semantic information may be polished according to the scene information, thereby determining Video semantic information corresponding to the video frame of the video.
例如,继上例,若所述场景信息为“愉悦”,则所述第四装置4将包含“愉悦”的“米色”和“微笑”作为视频语义信息中的一部分,从而,所确定的视频语义信息为“坐在米色沙发上微笑”等。For example, following the above example, if the scene information is "pleasure", the fourth device 4 takes "beige" and "smile" containing "pleasure" as part of the video semantic information, thereby determining the video. The semantic information is "sit on the beige sofa and smile".
优选地,所述处理设备还包括用于获取与所述一个或多个视频帧相对应的语音和/或字幕信息的装置(以下简称“第六装置”,未示出),所述第四装置4根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息。Preferably, the processing device further comprises means for acquiring voice and/or subtitle information corresponding to the one or more video frames (hereinafter referred to as "sixth device", not shown), the fourth The device 4 determines the video semantic information corresponding to the video frame of the video according to the scene information, in combination with the voice and/or subtitle information.
具体地,所述第六装置还可以通过直接获取该视频帧所对应视频的语音文件或字幕文件等方式,或是通过对所述视频进行语音提取或字幕提取的方式,获取与所述一个或多个视频帧相对应的语音和/或字幕信息。Specifically, the sixth device may also obtain the voice file or the caption file of the video corresponding to the video frame, or obtain the voice or the caption extraction of the video. Voice and/or subtitle information corresponding to multiple video frames.
然后,所述第四装置4根据所述场景信息,结合所述语音和/或字幕信息,通过利用所述语音和/或字幕来对所述场景信息与对象特征信息进行组合,以生成视频语义信息;或是将所述语音和/或字幕信息直接作为所述视频语义信息中的一部分;或是利用所述语音和/或字幕信息来对所生成的候选视频语义信息进行筛选等,从而确定所述视频语义信息。Then, the fourth device 4 combines the scene information and the object feature information by using the voice and/or subtitle according to the scene information, in combination with the voice and/or subtitle information, to generate video semantics. Information; or directly using the voice and/or subtitle information as part of the video semantic information; or using the voice and/or subtitle information to filter the generated candidate video semantic information, etc., thereby determining The video semantic information.
优选地,所述处理设备还包括用于获取一个或多个视频检索序列的装置(以下简称“第七装置”,未示出)以及用于将所述视频检索序列与所述视频语义信息进行匹配,以确定所述视频检索序列所对应的目标视频的装置(以下简称“第八装置”,未示出)。Preferably, the processing device further comprises means for acquiring one or more video retrieval sequences (hereinafter referred to as "seventh device", not shown) and for performing the video retrieval sequence and the video semantic information A device (hereinafter referred to as "eighth device", not shown) that matches to determine a target video corresponding to the video search sequence.
具体地,所述第七装置通过直接与用户交互或者与其他能够提供视频检 索序列的装置相交互,以获取一个或多个视频检索序列;然后,所述第八装置将所述视频检索序列与所确定的各个视频所对应的各个帧/连续帧的视频语义信息进行匹配,若所述视频检索序列与所述视频语义信息匹配,则把该视频语义信息所对应的视频作为目标视频。Specifically, the seventh device acquires one or more video retrieval sequences by directly interacting with a user or with other devices capable of providing a video retrieval sequence; then, the eighth device compares the video retrieval sequence with The determined video semantic information of each frame/continuous frame corresponding to each video is matched. If the video retrieval sequence matches the video semantic information, the video corresponding to the video semantic information is used as the target video.
进一步地,所述处理设备还可以将所述目标视频提供给发送所述视频检索序列的用户。Further, the processing device may further provide the target video to a user who sends the video retrieval sequence.
图2示出根据本发明另一个方面的一种用于获取视频语义信息的方法流程图。其中,在步骤S1中,所述处理设备提取视频中的一个或多个视频帧;在步骤S2中,所述处理设备对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象;在步骤S3中,所述处理设备根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所对应的场景信息;在步骤S4中,所述处理设备根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息。2 shows a flow chart of a method for acquiring video semantic information in accordance with another aspect of the present invention. Wherein, in step S1, the processing device extracts one or more video frames in the video; in step S2, the processing device performs target extraction on the video frame to determine included in the video frame. a visual object; in step S3, the processing device determines scene information corresponding to the one or more video frames according to the object feature corresponding to the visual object; in step S4, the processing device according to the The scene information is used to determine video semantic information corresponding to the video frame of the video.
在步骤S1中,所述处理设备提取视频中的一个或多个视频帧。In step S1, the processing device extracts one or more video frames in the video.
具体地,在步骤S1中,所述处理设备通过自动提取或是人工指定提取等方式,从待分析的视频中,提取出一个或多个视频帧。本领域技术人员应能理解,所提取的视频帧可以是一张或多张,也可以是连续或不连续。当所述处理设备提取了视频帧后,后续的步骤即对所提取的视频帧进行视频语义信息的分析。Specifically, in step S1, the processing device extracts one or more video frames from the video to be analyzed by means of automatic extraction or manual designation. Those skilled in the art will appreciate that the extracted video frames may be one or more, or may be continuous or discontinuous. After the processing device extracts the video frame, the subsequent step performs video semantic information analysis on the extracted video frame.
优选地,在步骤S1中,所述处理设备提取视频中的多个视频帧,其中,所述多个视频帧是连续的。也即,所述处理设备可以同时提取同一视频中的多个连续的视频帧;在此,所述连续的视频帧,即为相邻两个视频帧的播放时间相差在一定阈值范围内的多个视频帧。当所述处理设备提取了连续视频帧后,后续的步骤即对所提取的连续视频帧进行视频语义信息的分析。Preferably, in step S1, the processing device extracts a plurality of video frames in the video, wherein the plurality of video frames are continuous. That is, the processing device can simultaneously extract multiple consecutive video frames in the same video; here, the continuous video frames, that is, the playback time of two adjacent video frames are within a certain threshold range. Video frames. After the processing device extracts the continuous video frame, the subsequent step performs the analysis of the video semantic information on the extracted continuous video frame.
在此,所执行的分析可以将连续的视频帧作为一个整体进行分析,从而对多个连续的视频帧得到一个视频语义信息;也可以将连续的视频帧分别分析,从而对多个连续的视频帧得到多个视频语义信息。Here, the performed analysis can analyze continuous video frames as a whole to obtain one video semantic information for multiple consecutive video frames; or analyze successive video frames separately for multiple consecutive videos. The frame gets multiple video semantic information.
更优选地,所述方法还包括步骤S5。其中,在步骤S5中,所述处理设备对视频进行场景分割;然后,在步骤S1中,所述处理设备根据所述视频中所对应的场景,提取所述视频中的多个视频帧,其中,所述多个视频帧是连续的且对应于同一场景。More preferably, the method further comprises a step S5. In step S5, the processing device performs scene segmentation on the video. Then, in step S1, the processing device extracts multiple video frames in the video according to the scene corresponding to the video, where The plurality of video frames are contiguous and correspond to the same scene.
具体地,在步骤S5中,所述处理设备根据视频的时间、色彩、人物变更等,对所述视频进行场景分割。例如,将所述视频按照每一分钟为一个场景的默认设置,将该视频的每一分钟作为一个场景以进行分割;或者,根据所述视频中的色彩变化,如从冷色调切换为暖色调时,则认为场景发生变化,从而执行场景分割;或者,根据所述视频中的人物变更,如画面中从两个人变成三个人,则认为场景发生了变化,从而执行场景分割。Specifically, in step S5, the processing device performs scene segmentation on the video according to time, color, character change, and the like of the video. For example, the video is set to a default setting of one scene per minute, and each minute of the video is divided into one scene for segmentation; or, according to color changes in the video, such as switching from cool to warm When the scene is changed, the scene is divided to perform scene segmentation; or, according to the person change in the video, if the screen changes from two people to three people, the scene is considered to have changed, thereby performing scene segmentation.
本领域技术人员应能理解,其他的场景分割方法也适用于本发明,以供本发明对所述视频进行场景分割。Those skilled in the art will appreciate that other scene segmentation methods are also applicable to the present invention for scene segmentation of the video by the present invention.
然后,在步骤S1中,所述处理设备根据所分割后的视频,按照每个场景,对所述视频进行视频帧的提取,换言之,所述处理设备可以对所述多个场景中的至少一个来提取视频帧,所提取的视频帧是连续的且对应于该场景。Then, in step S1, the processing device performs video frame extraction on the video according to each scene according to the segmented video, in other words, the processing device may perform at least one of the multiple scenes. To extract a video frame, the extracted video frames are contiguous and correspond to the scene.
优选地,当满足一个或多个触发条件时,在步骤S1中,所述处理设备提取视频中的一个或多个视频帧。Preferably, when one or more trigger conditions are met, in step S1 the processing device extracts one or more video frames in the video.
具体地,在步骤S1中,所述处理设备可以实时地或是基于事件触发地,对所述视频进行检测,判断当前视频是否满足触发条件;当满足所述触发条件时,所述处理设备则提取视频中的一个或多个视频帧。Specifically, in step S1, the processing device may detect the video in real time or based on an event, and determine whether the current video meets a trigger condition; when the trigger condition is met, the processing device Extract one or more video frames in the video.
其中,所述触发条件包括以下至少任一项:The triggering condition includes at least one of the following:
根据所述视频的播放时间长度触发:例如,当所述视频的播放时间长度在第五分钟时,则执行视频帧的提取,在此,若某一用户从该视频的起始阶段(即0分0秒)开始播放,则表示在第五分钟(即5分0秒)进行视频帧的提取,若某一用户从该视频的第十分钟(即10分0秒)开始播放,则表示在第十五分钟(即15分0秒)进行视频帧的提取。Triggering according to the length of the playing time of the video: for example, when the playing time length of the video is in the fifth minute, the extraction of the video frame is performed, where if a certain user starts from the beginning of the video (ie, 0) Start playing in 0 seconds), indicating that the video frame is extracted in the fifth minute (ie, 5 minutes and 0 seconds). If a user starts playing from the tenth minute of the video (ie, 10 minutes and 0 seconds), it means that The video frame is extracted at the fifteenth minute (ie, 15 minutes and 0 seconds).
根据所述视频的播放时间点触发:例如,在所述视频的第三分钟、第五分钟和第七分钟触发视频帧的提取。Triggering according to the playback time point of the video: for example, the extraction of the video frame is triggered at the third, fifth, and seventh minutes of the video.
根据所述视频的一个或多个播放内容触发:其中,所述播放内容包括但不限于语音、人物、物品、图案等。例如,当所述视频出现语音时,则触发对视频帧的提取;当所述视频出现人物“曹操”时,则触发对视频帧的提取;当所述视频出现物品“可口可乐”时,则触发对视频帧的提取;当所述视频出现“CCTV”的LOGO时,则触发对视频帧的提取等。Triggering according to one or more play content of the video: wherein the play content includes, but is not limited to, a voice, a person, an item, a pattern, and the like. For example, when the video has a voice, the extraction of the video frame is triggered; when the character “Cao Cao” appears in the video, the extraction of the video frame is triggered; when the video “Coca-Cola” appears in the video, the trigger is triggered. The extraction of the video frame; when the video shows the LOGO of "CCTV", the extraction of the video frame and the like are triggered.
在此,所述处理设备可以基于缺省设置,来决定“当触发对视频帧的提取时,提取多少个视频帧、所提取的视频帧所对应的视频长度”等;也可以基于所述触发条件,来确定提取视频帧的参数设置,例如,提取频率(一秒内提取几帧)、提取数量(一共提取多少个视频帧)、所提取的视频帧所对应的视频长度(如当从某一时间点开始提取视频帧后,再到哪一时间点停止提取)、提取哪些视频帧等。例如,不同的触发条件可以对应不同的参数设置,如若触发视频帧提取的播放内容不同,则所对应的提取视频帧的参数设置也不同。Here, the processing device may determine, according to a default setting, “how many video frames are extracted, the video length corresponding to the extracted video frame when triggering the extraction of the video frame”, etc.; Condition, to determine the parameter settings for extracting the video frame, for example, extracting the frequency (fetching a few frames in one second), extracting the number (how many video frames are extracted), and the video length corresponding to the extracted video frame (eg, when from a certain After a time point starts to extract the video frame, then at which point in time to stop the extraction), which video frames are extracted, and the like. For example, different trigger conditions may correspond to different parameter settings. For example, if the playback content of the triggered video frame extraction is different, the parameter settings of the corresponding extracted video frames are also different.
在步骤S2中,所述处理设备对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象。In step S2, the processing device performs target extraction on the video frame to determine a visual object included in the video frame.
具体地,在步骤S2中,所述处理设备通过对所述视频帧进行图像处理,基于各类图像处理算法,如通过对所述视频帧进行纹理提取、色彩分析等方式,提取所述视频帧中所包含的目标;然后,通过将所提取的目标与一个或多个对象模型进行比较,以确定所述目标所对应的视觉对象,进而确定所述视频帧中所包含的视觉对象。Specifically, in step S2, the processing device extracts the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. The target included in the video frame is then determined by comparing the extracted target with one or more object models to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame.
在此,所述视觉对象包括但不限于人物、物品、背景、图标等。Here, the visual objects include, but are not limited to, people, objects, backgrounds, icons, and the like.
例如,若对某个电视剧的视频帧进行目标提取,则可以确定该视频帧中包含四个目标,然后,将这四个目标与现有的对象模型进行匹配,可以确定这四个目标分别是:“两个人物、一个沙发、一个茶几”。在此,所述的对象模型可以是预设置的,也可以是基于机器学习的方式不断学习获取的。For example, if a video frame of a TV drama is subjected to target extraction, it can be determined that the video frame contains four targets, and then the four targets are matched with the existing object model, and it can be determined that the four targets are respectively : "Two characters, a sofa, a coffee table." Here, the object model may be pre-set, or may be continuously learned and acquired based on a machine learning method.
优选地,在步骤S2中,所述处理设备可以对所述视频帧进行目标提取,结合所述视频的视频相关信息,以确定所述视频帧中所包含的视觉对象。Preferably, in step S2, the processing device may perform target extraction on the video frame, and combine video related information of the video to determine a visual object included in the video frame.
具体地,所述视频相关信息包括但不限于视频内容简介、视频主要人物简介、视频相关搜索结果、视频出品方/作者等。Specifically, the video related information includes, but is not limited to, a video content introduction, a video main character profile, a video related search result, a video producer/author, and the like.
在步骤S2中,所述处理设备通过对所述视频帧进行图像处理,基于各类图像处理算法,如通过对所述视频帧进行纹理提取、色彩分析等方式,提取所述视频帧中所包含的目标。然后,通过将所提取的目标与一个或多个与所述视频相关信息关联的对象模型进行匹配,以确定所述目标所对应的视觉对象,进而确定所述视频帧中所包含的视觉对象;或者,先将将所提取的目标与一个或多个通用对象模型进行匹配,当确定了所述目标的所述类别(如人物、物品、背景、图标等类别)后,将所述目标与该视频信息相关联的该类别中的具体模型进行匹配,以确定所述目标所对应的视觉对象,进而确定所述视频帧中所包含的视觉对象。In step S2, the processing device extracts the video frame by performing image processing on the video frame, based on various image processing algorithms, such as performing texture extraction, color analysis, and the like on the video frame. The goal. Then, by matching the extracted target with one or more object models associated with the video related information to determine a visual object corresponding to the target, thereby determining a visual object included in the video frame; Alternatively, the extracted target is first matched with one or more common object models, and when the category of the target (such as a character, an item, a background, an icon, etc.) is determined, the target is A specific model in the category associated with the video information is matched to determine a visual object corresponding to the target, thereby determining a visual object contained in the video frame.
例如,若所述视频相关信息为该视频的演员表,则将所提取的目标(主要为人物类目标)与所述演员表中的各个演员的图像模型进行匹配,以确定所提取的目标是否是该演员;或者,将所提取的目标与通用模型进行匹配,首先确定该目标的类别是“人物”,然后,将所述人物目标与该视频的各个演员的图像模型进行匹配,以确定所提取的目标是否是该演员。For example, if the video related information is a cast of the video, the extracted target (mainly a character-like target) is matched with an image model of each actor in the cast, to determine whether the extracted target is Is the actor; or, matching the extracted target with the general model, first determining that the target's category is "person", and then matching the person target with the image model of each actor of the video to determine Whether the target of the extraction is the actor.
在步骤S3中,所述处理设备根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所对应的场景信息。In step S3, the processing device determines the scene information corresponding to the one or more video frames according to the object features corresponding to the visual object.
具体地,在步骤S3中,所述处理设备通过对所述视觉对象的图像特征进行分析,以确定所述视觉对象所对应的对象特征。Specifically, in step S3, the processing device determines an object feature corresponding to the visual object by analyzing an image feature of the visual object.
其中,所述对象特征包括但不限于动作、情绪、色彩、位置等,进一步地,对于同一视频帧内的多个视觉对象,所述对象特征还包括多个视觉对象之间的交互或关联关系。The object features include, but are not limited to, actions, emotions, colors, positions, and the like. Further, for a plurality of visual objects within the same video frame, the object features further include interactions or associations between the plurality of visual objects. .
然后,所述处理设备根据所述对象特征,对多个所述对象特征间的相互关系进行分析,以确定所述一个或多个视频帧所对应的场景信息。在此,所述场景信息的确定,可以基于预先设定的对象特征模型来进行分析,所述对象特征模型中包括多种对象特征或其组合与不同的场景信息的映射关系。Then, the processing device analyzes the relationship between the plurality of the object features according to the object feature to determine scene information corresponding to the one or more video frames. Here, the determination of the scene information may be performed based on a preset object feature model, where the object feature model includes a mapping relationship between multiple object features or a combination thereof and different scene information.
其中,所述场景信息包括但不限于如诙谐、恐怖、搞笑、愉悦等表述场景状态或场景氛围的信息。The scene information includes, but is not limited to, information such as humor, horror, hilarity, pleasure, and the like that express a scene state or a scene atmosphere.
例如,所述处理设备确定所述视觉对象为“沙发”、“王力宏”,经过所述 处理设备对上述视觉对象的分析,“沙发”的对象特征为“色彩为米色”,“王力宏”的对象特征为“坐”、“微笑”,则在步骤S3中,所述处理设备基于对象特征模型中对“沙发”、“米色”、“坐”、“微笑”、“王力宏(人物)”的映射关系,通过基于对其中的“色彩情绪”、“表情情绪”等的分析,确定其场景信息为“愉悦”。For example, the processing device determines that the visual object is “sofa” or “Wang Lihong”, and the object of the “sofa” is “color is beige” and “Wang Lihong” is analyzed by the processing device. The feature is “sitting” and “smile”, then in step S3, the processing device maps “sofa”, “beige”, “sitting”, “smile”, “Wang Lihong (person)” based on the object feature model. The relationship is determined to be "pleasure" by analyzing based on the analysis of "color emotion", "expression emotion", and the like.
在此,本领域技术人员应能理解,当所述处理设备对所述对象特征进行分析时,可以先基于两两对象特征进行分析,或将两两对象特征与所述对象特征模型中的映射关系进行匹配;从而,经过多轮的分析或匹配后,最终得到最后的场景信息。此外,所述处理设备也可以直接将全部对象特征进行分析与匹配,从而直接得到场景信息。Here, those skilled in the art should understand that when the processing device analyzes the object features, it may first analyze based on the two-two object features, or map the two-two object features with the object feature model. The relationship is matched; thus, after multiple rounds of analysis or matching, the final scene information is finally obtained. In addition, the processing device can directly analyze and match all the object features, thereby directly obtaining the scene information.
优选地,所述步骤S3包括步骤S31(未示出)以及步骤S32(未示出);其中,在步骤S31中,所述处理设备确定每个所述视觉对象所对应的对象特征;在步骤S32中,所述处理设备根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息。Preferably, the step S3 includes a step S31 (not shown) and a step S32 (not shown); wherein, in step S31, the processing device determines an object feature corresponding to each of the visual objects; In S32, the processing device determines the scene information corresponding to the one or more video frames according to the association information between the object features corresponding to each of the visual objects.
具体地,在步骤S31中,所述处理设备对每个所述视觉对象的图像特征分别进行分析,从而分别确定所述视觉对象所对应的对象特征。Specifically, in step S31, the processing device separately analyzes image features of each of the visual objects, thereby respectively determining object features corresponding to the visual objects.
更优选地,在步骤S31中,所述处理设备可以根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。More preferably, in step S31, the processing device may determine an object feature corresponding to each of the visual objects according to an object attribute of each of the visual objects.
具体地,在步骤S31中,所述处理设备首先基于所述视觉对象的名称等,确定每个视觉对象所对应的对象属性;其中,所述对象属性包括但不限于所述视觉对象的各种类别。Specifically, in step S31, the processing device first determines an object attribute corresponding to each visual object based on a name of the visual object or the like; wherein the object attribute includes, but is not limited to, various types of the visual object. category.
例如,当所述视觉对象为“沙发”时,其所对应的对象属性为“家具”,相应地,“家具”所具有的对象特征应当包括“颜色、形状、大小”等;例如,当所述视觉对象为“王力宏”时,其所对应的对象属性为“人物”和/或“娱乐明星”,则相应地,“人物”所具有的对象特征为“性别、表情、动作、服饰”等,“娱乐明星”所具有的对象特征为“姓名”等。For example, when the visual object is a “sofa”, the object attribute corresponding to it is “furniture”, and correspondingly, the object feature of “furniture” should include “color, shape, size”, etc.; for example, When the visual object is "Wang Lihong", the corresponding object attribute is "person" and/or "entertainment star", and accordingly, the "person" has the object characteristics of "gender, expression, action, costume", etc. The object features of "Entertainment Stars" are "name" and the like.
然后,在步骤S31中,所述处理设备根据所确定的对象属性所要求的对象特征,对于所述具体的视觉对象,按照上述对象特征进行分析,以得出该 视觉对象所对应的对象特征。Then, in step S31, the processing device analyzes the specific visual object according to the determined object feature according to the determined object attribute to obtain the object feature corresponding to the visual object.
例如,“沙发”的对象特征即为“颜色米色、L形沙发、大”,“王力宏”的对象特征即为“性别男、表情微笑、动作坐下、服饰衬衫、姓名王力宏”。For example, the object characteristics of "sofa" is "color beige, L-shaped sofa, large", and the object characteristics of "Wang Lihong" are "sex male, smiling expression, sitting down, dress shirt, name Wang Lihong".
然后,在步骤S32中,所述处理设备根据每个视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息。在此,所述关联性信息可以基于预先设定的关联模型进行得出;所述关联模型中包括两种或多种不同的对象特征间的关联性信息,这种关联性信息可以基于对大量数据的分析与训练得到。例如,“沙发”和“坐下”具有关联性信息,“微笑”和“坐下”也具有关联性信息,“米色”和“微笑”在情绪上也具有关联性信息(即都表示愉悦),从而,继上例,所确定的场景信息为“愉悦”。Then, in step S32, the processing device determines the scene information corresponding to the one or more video frames according to the association information between the object features corresponding to each of the visual objects. Here, the association information may be derived based on a preset association model; the association model includes association information between two or more different object features, and the association information may be based on a large number of Data analysis and training were obtained. For example, “sofa” and “sit down” have associated information, “smile” and “sit down” also have relevance information, and “beige” and “smile” also have emotionally related information (ie, both express pleasure). Thus, following the above example, the determined scene information is "pleasant".
在步骤S4中,所述处理设备根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息。In step S4, the processing device determines video semantic information corresponding to the video frame of the video according to the scene information.
具体地,在步骤S4中,所述处理设备根据所确定的场景信息,将该场景信息转换为“文字或语音”等形式的信息,以作为所述视频的视频帧所对应的视频语义信息。其中,所述视频语义信息包括但不限于场景概况、场景状态和/或场景细节信息等,如时间、地点、人物、动作、情绪等。其中,所述情绪包括开心、生气、恐惧、平静等,所述场景信息包括诙谐、恐怖、搞笑、愉悦等。Specifically, in step S4, the processing device converts the scene information into information in the form of “text or voice” according to the determined scene information, so as to be the video semantic information corresponding to the video frame of the video. The video semantic information includes, but is not limited to, a scene profile, a scene state, and/or scene detail information, such as time, place, person, action, mood, and the like. The emotions include happiness, anger, fear, calmness, etc., and the scene information includes humor, horror, hilarity, pleasure, and the like.
在此,在步骤S4中,所述处理设备可以直接将所述场景信息作为所述视频语义信息,也可以将所述场景信息进行整理以作为所述视频语义信息,以使其符合语言表达习惯,例如,继上例,所述视频语义信息可以是“愉悦的场景”,也可以是如“王力宏微笑着坐在沙发上”等表示“某人在某地做某事”、“某人以某种情绪做某事”、“某物和某物之间处于何种状态”等的具体信息。Here, in step S4, the processing device may directly use the scene information as the video semantic information, or may organize the scene information as the video semantic information to conform to language expression habits. For example, following the above example, the video semantic information may be a "pleasure scene", or may be such as "Wang Lihong smiles and sits on the sofa" and the like, "someone is doing something in a certain place", "someone Specific information such as "something is doing something", "what state is between something and something".
优选地,在步骤S4中,所述处理设备可以对所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,以生成候选视频语义信息;根据所述场景信息,结合所述候选视频语义信息,确定所述视频的视频帧所对应的视频语义信息。Preferably, in step S4, the processing device may semantically combine the visual object and the object feature corresponding to the visual object to generate candidate video semantic information; and combine the candidate video according to the scene information. The semantic information determines video semantic information corresponding to the video frame of the video.
具体地,在步骤S4中,所述处理设备可以将所述视觉对象以及所述视觉 对象所对应的对象特征进行语义组合,例如,继上例,当所述处理设备确定所述视觉对象为“沙发”、“王力宏”,且“沙发”的对象特征为“色彩为米色”、“王力宏”的对象特征为“坐”、“微笑”时,所述处理设备可以将上述对象特征进行语义组合,如“沙发+米色”、“沙发+王力宏+坐”、“微笑+坐”等。所生成的上述语义组合,则作为候选视频语义信息。Specifically, in step S4, the processing device may semantically combine the visual object and the object feature corresponding to the visual object, for example, in the above example, when the processing device determines that the visual object is “ "The sofa", "Wang Lihong", and the object characteristics of the "sofa" are "color is beige", "Wang Leehom" object features "sit", "smile", the processing device can semantically combine the above object features, Such as "sofa + beige", "sofa + Wang Lee Hoon + sit", "smile + sit" and so on. The generated semantic combination described above is used as candidate video semantic information.
在此,本领域技术人员应能理解,上述语义组合可以是两两组合,也可以是多个对象特征的组合;可以是同一视觉对象的多个对象特征间的组合,也可以是不同视觉对象的多个对象特征间的组合。Here, those skilled in the art should understand that the above semantic combination may be a combination of two or two, or a combination of multiple object features; it may be a combination of multiple object features of the same visual object, or may be different visual objects. A combination of multiple object features.
然后,所述处理设备根据所述场景信息,对所述候选视频语义信息进行重新组合或挑选;进一步地,还可以根据所述场景信息,对所述候选视频语义信息进行润色,从而确定所述视频的视频帧所对应的视频语义信息。Then, the processing device recombines or selects the candidate video semantic information according to the scenario information. Further, the candidate video semantic information may be polished according to the scenario information, thereby determining the Video semantic information corresponding to the video frame of the video.
例如,继上例,若所述场景信息为“愉悦”,则所述处理设备将包含“愉悦”的“米色”和“微笑”作为视频语义信息中的一部分,从而,所确定的视频语义信息为“坐在米色沙发上微笑”等。For example, following the above example, if the scene information is "pleasure", the processing device will include "beige" and "smile" of "pleasure" as part of the video semantic information, thereby determining the video semantic information. Waiting for "Sitting on a beige sofa".
优选地,所述方法还包括步骤S6(未示出),其中,在步骤S6中,所述处理设备获取与所述一个或多个视频帧相对应的语音和/或字幕信息;在步骤S4中,所述处理设备根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息。Preferably, the method further comprises a step S6 (not shown), wherein in step S6, the processing device acquires voice and/or subtitle information corresponding to the one or more video frames; in step S4 The processing device determines video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
具体地,在步骤S6中,所述处理设备还可以通过直接获取该视频帧所对应视频的语音文件或字幕文件等方式,或是通过对所述视频进行语音提取或字幕提取的方式,获取与所述一个或多个视频帧相对应的语音和/或字幕信息。Specifically, in step S6, the processing device may further acquire the voice file or the subtitle file of the video corresponding to the video frame, or obtain the voice extraction or subtitle extraction of the video. The voice and/or subtitle information corresponding to the one or more video frames.
然后,在步骤S4中,所述处理设备根据所述场景信息,结合所述语音和/或字幕信息,通过利用所述语音和/或字幕来对所述场景信息与对象特征信息进行组合,以生成视频语义信息;或是将所述语音和/或字幕信息直接作为所述视频语义信息中的一部分;或是利用所述语音和/或字幕信息来对所生成的候选视频语义信息进行筛选等,从而确定所述视频语义信息。Then, in step S4, the processing device combines the scene information and the object feature information by using the voice and/or subtitle according to the scene information, in combination with the voice and/or subtitle information, to Generating video semantic information; or directly using the voice and/or subtitle information as part of the video semantic information; or using the voice and/or subtitle information to filter the generated candidate video semantic information, etc. , thereby determining the video semantic information.
优选地,所述方法还包括步骤S7(未示出)以及步骤S8(未示出);其 中,在步骤S7中,所述处理设备获取一个或多个视频检索序列;在步骤S8中,所述处理设备将所述视频检索序列与所述视频语义信息进行匹配,以确定所述视频检索序列所对应的目标视频。Preferably, the method further comprises a step S7 (not shown) and a step S8 (not shown); wherein, in step S7, the processing device acquires one or more video retrieval sequences; in step S8, The processing device matches the video retrieval sequence with the video semantic information to determine a target video corresponding to the video retrieval sequence.
具体地,在步骤S7中,所述处理设备通过直接与用户交互或者与其他能够提供视频检索序列的装置相交互,以获取一个或多个视频检索序列;然后,在步骤S8中,所述处理设备将所述视频检索序列与所确定的各个视频所对应的各个帧/连续帧的视频语义信息进行匹配,若所述视频检索序列与所述视频语义信息匹配,则把该视频语义信息所对应的视频作为目标视频。Specifically, in step S7, the processing device acquires one or more video retrieval sequences by directly interacting with the user or with other devices capable of providing a video retrieval sequence; then, in step S8, the processing The device matches the video retrieval sequence with the video semantic information of each frame/continuous frame corresponding to each determined video, and if the video retrieval sequence matches the video semantic information, the video semantic information corresponds to Video as the target video.
进一步地,所述处理设备还可以将所述目标视频提供给发送所述视频检索序列的用户。Further, the processing device may further provide the target video to a user who sends the video retrieval sequence.
需要注意的是,本发明可在软件和/或软件与硬件的组合体中被实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中,本发明的软件程序可以通过处理器执行以实现上文所述步骤或功能。同样地,本发明的软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另外,本发明的一些步骤或功能可采用硬件来实现,例如,作为与处理器配合从而执行各个步骤或功能的电路。It should be noted that the present invention can be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device. In one embodiment, the software program of the present invention may be executed by a processor to implement the steps or functions described above. Likewise, the software program (including related data structures) of the present invention can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like. Additionally, some of the steps or functions of the present invention may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various steps or functions.
另外,本发明的一部分可被应用为计算机程序产品,例如计算机程序指令,当其被计算机执行时,通过该计算机的操作,可以调用或提供根据本发明的方法和/或技术方案。而调用本发明的方法的程序指令,可能被存储在固定的或可移动的记录介质中,和/或通过广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据所述程序指令运行的计算机设备的工作存储器中。在此,根据本发明的一个实施例包括一个装置,该装置包括用于存储计算机程序指令的存储器和用于执行程序指令的处理器,其中,当该计算机程序指令被该处理器执行时,触发该装置运行基于前述根据本发明的多个实施例的方法和/或技术方案。Additionally, a portion of the invention can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or solution in accordance with the present invention. The program instructions for invoking the method of the present invention may be stored in a fixed or removable recording medium and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a The working memory of the computer device in which the program instructions are run. Herein, an embodiment in accordance with the present invention includes a device including a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, triggering The apparatus operates based on the aforementioned methods and/or technical solutions in accordance with various embodiments of the present invention.
对于本领域技术人员而言,显然本发明不限于上述示范性实施例的 细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体形式实现本发明。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It is obvious to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and the present invention can be embodied in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in the present invention. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is to be understood that the word "comprising" does not exclude other elements or steps. A plurality of units or devices recited in the device claims may also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.

Claims (23)

  1. 一种用于获取视频语义信息的方法,其中,该方法包括以下步骤:A method for acquiring video semantic information, wherein the method comprises the following steps:
    提取视频中的一个或多个视频帧;Extracting one or more video frames in the video;
    对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象;Performing target extraction on the video frame to determine a visual object included in the video frame;
    根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频帧所对应的场景信息;Determining scene information corresponding to the one or more video frames according to an object feature corresponding to the visual object;
    根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息。Determining video semantic information corresponding to the video frame of the video according to the scene information.
  2. 根据权利要求1所述的方法,其中,提取视频中的一个或多个视频帧的步骤包括:The method of claim 1 wherein the step of extracting one or more video frames in the video comprises:
    提取视频中的多个视频帧,其中,所述多个视频帧是连续的。A plurality of video frames in the video are extracted, wherein the plurality of video frames are contiguous.
  3. 根据权利要求2所述的方法,其中,该方法还包括:The method of claim 2, wherein the method further comprises:
    对视频进行场景分割;Perform scene segmentation on the video;
    其中,提取视频中的多个视频帧的步骤包括:The step of extracting multiple video frames in the video includes:
    根据所述视频中所对应的场景,提取所述视频中的多个视频帧,其中,所述多个视频帧是连续的且对应于同一场景。And extracting, according to the scene corresponding to the video, a plurality of video frames in the video, wherein the plurality of video frames are continuous and correspond to the same scene.
  4. 根据权利要求1至3中任一项所述的方法,其中,提取视频中的一个或多个视频帧的步骤包括:The method of any one of claims 1 to 3, wherein the step of extracting one or more video frames in the video comprises:
    当满足一个或多个触发条件时,提取视频中的一个或多个视频帧;Extracting one or more video frames in the video when one or more trigger conditions are met;
    其中,所述触发条件包括以下至少任一项:The triggering condition includes at least one of the following:
    根据所述视频的播放时间长度触发;Trigger according to the length of play time of the video;
    根据所述视频的播放时间点触发;Trigger according to the playing time point of the video;
    根据所述视频的一个或多个播放内容触发。Triggered according to one or more play content of the video.
  5. 根据权利要求1至4中任一项所述的方法,其中,确定所述视频帧中所包含的视觉对象的步骤包括:The method according to any one of claims 1 to 4, wherein the determining of the visual object contained in the video frame comprises:
    对所述视频帧进行目标提取,结合所述视频的视频相关信息,以确定所述视频帧中所包含的视觉对象。Performing target extraction on the video frame, combining video related information of the video to determine a visual object included in the video frame.
  6. 根据权利要求1至5中任一项所述的方法,其中,确定所述一个或多 个视频帧所对应的场景信息的步骤包括:The method of any one of claims 1 to 5, wherein the determining the scene information corresponding to the one or more video frames comprises:
    确定每个所述视觉对象所对应的对象特征;Determining an object feature corresponding to each of the visual objects;
    根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息。And determining, according to the association information between the object features corresponding to each of the visual objects, the scene information corresponding to the one or more video frames.
  7. 根据权利要求6所述的方法,其中,确定每个所述视觉对象所对应的对象特征的步骤包括:The method of claim 6 wherein the step of determining an object feature corresponding to each of the visual objects comprises:
    根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。Determining an object feature corresponding to each of the visual objects according to an object property of each of the visual objects.
  8. 根据权利要求1至7中任一项所述的方法,其中,确定所述视频的视频帧所对应的视频语义信息的步骤包括:The method according to any one of claims 1 to 7, wherein the step of determining video semantic information corresponding to the video frame of the video comprises:
    对所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,以生成候选视频语义信息;Semantically combining the visual object and object features corresponding to the visual object to generate candidate video semantic information;
    根据所述场景信息,结合所述候选视频语义信息,确定所述视频的视频帧所对应的视频语义信息。And determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the candidate video semantic information.
  9. 根据权利要求1至8中任一项所述的方法,其中,该方法还包括:The method of any of claims 1 to 8, wherein the method further comprises:
    获取与所述一个或多个视频帧相对应的语音和/或字幕信息;Acquiring voice and/or subtitle information corresponding to the one or more video frames;
    其中,确定所述视频的视频帧所对应的视频语义信息的步骤包括:The step of determining video semantic information corresponding to the video frame of the video includes:
    根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息。Determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
  10. 根据权利要求1至9中任一项所述的方法,其中,该方法还包括:The method according to any one of claims 1 to 9, wherein the method further comprises:
    获取一个或多个视频检索序列;Obtain one or more video retrieval sequences;
    将所述视频检索序列与所述视频语义信息进行匹配,以确定所述视频检索序列所对应的目标视频。The video retrieval sequence is matched with the video semantic information to determine a target video corresponding to the video retrieval sequence.
  11. 一种用于获取视频语义信息的处理设备,其中,所述处理设备包括:A processing device for acquiring video semantic information, wherein the processing device includes:
    用于提取视频中的一个或多个视频帧的装置;Means for extracting one or more video frames in a video;
    用于对所述视频帧进行目标提取,以确定所述视频帧中所包含的视觉对象的装置;Means for performing target extraction on the video frame to determine a visual object included in the video frame;
    用于根据所述视觉对象所对应的对象特征,以确定所述一个或多个视频 帧所对应的场景信息的装置;Means for determining scene information corresponding to the one or more video frames according to an object feature corresponding to the visual object;
    用于根据所述场景信息,确定所述视频的视频帧所对应的视频语义信息的装置。Means for determining video semantic information corresponding to a video frame of the video according to the scene information.
  12. 根据权利要求11所述的处理设备,其中,用于提取视频中的一个或多个视频帧的装置用于:The processing device of claim 11 wherein the means for extracting one or more video frames in the video is for:
    提取视频中的多个视频帧,其中,所述多个视频帧是连续的。A plurality of video frames in the video are extracted, wherein the plurality of video frames are contiguous.
  13. 根据权利要求12所述的处理设备,其中,所述处理设备还包括:The processing device of claim 12, wherein the processing device further comprises:
    用于对视频进行场景分割的装置;Means for performing scene segmentation on a video;
    其中,用于提取视频中的多个视频帧的装置用于:Wherein the means for extracting a plurality of video frames in the video is used to:
    根据所述视频中所对应的场景,提取所述视频中的多个视频帧,其中,所述多个视频帧是连续的且对应于同一场景。And extracting, according to the scene corresponding to the video, a plurality of video frames in the video, wherein the plurality of video frames are continuous and correspond to the same scene.
  14. 根据权利要求11至13中任一项所述的处理设备,其中,用于提取视频中的一个或多个视频帧的装置用于:A processing device according to any one of claims 11 to 13, wherein the means for extracting one or more video frames in the video is for:
    当满足一个或多个触发条件时,提取视频中的一个或多个视频帧;Extracting one or more video frames in the video when one or more trigger conditions are met;
    其中,所述触发条件包括以下至少任一项:The triggering condition includes at least one of the following:
    根据所述视频的播放时间长度触发;Trigger according to the length of play time of the video;
    根据所述视频的播放时间点触发;Trigger according to the playing time point of the video;
    根据所述视频的一个或多个播放内容触发。Triggered according to one or more play content of the video.
  15. 根据权利要求11至14中任一项所述的处理设备,其中,用于确定所述视频帧中所包含的视觉对象的装置用于:A processing device according to any one of claims 11 to 14, wherein the means for determining a visual object contained in the video frame is for:
    对所述视频帧进行目标提取,结合所述视频的视频相关信息,以确定所述视频帧中所包含的视觉对象。Performing target extraction on the video frame, combining video related information of the video to determine a visual object included in the video frame.
  16. 根据权利要求11至15中任一项所述的处理设备,其中,用于确定所述一个或多个视频帧所对应的场景信息的装置包括:The processing device according to any one of claims 11 to 15, wherein the means for determining scene information corresponding to the one or more video frames comprises:
    用于确定每个所述视觉对象所对应的对象特征的单元;a unit for determining an object feature corresponding to each of the visual objects;
    用于根据每个所述视觉对象所对应的对象特征间的关联性信息,确定所述一个或多个视频帧所对应的场景信息的单元。And determining, according to the association information between the object features corresponding to each of the visual objects, a unit of the scene information corresponding to the one or more video frames.
  17. 根据权利要求16所述的处理设备,其中,用于确定每个所述视觉对 象所对应的对象特征的单元用于:The processing device according to claim 16, wherein the means for determining an object feature corresponding to each of said visual objects is for:
    根据每个所述视觉对象的对象属性,确定每个所述视觉对象所对应的对象特征。Determining an object feature corresponding to each of the visual objects according to an object property of each of the visual objects.
  18. 根据权利要求11至17中任一项所述的处理设备,其中,用于确定所述视频的视频帧所对应的视频语义信息的装置用于:The processing device according to any one of claims 11 to 17, wherein the means for determining video semantic information corresponding to the video frame of the video is used to:
    对所述视觉对象以及所述视觉对象所对应的对象特征进行语义组合,以生成候选视频语义信息;Semantically combining the visual object and object features corresponding to the visual object to generate candidate video semantic information;
    根据所述场景信息,结合所述候选视频语义信息,确定所述视频的视频帧所对应的视频语义信息。And determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the candidate video semantic information.
  19. 根据权利要求11至18中任一项所述的处理设备,其中,所述处理设备还包括:The processing device according to any one of claims 11 to 18, wherein the processing device further comprises:
    用于获取与所述一个或多个视频帧相对应的语音和/或字幕信息的装置;Means for acquiring voice and/or caption information corresponding to the one or more video frames;
    其中,用于确定所述视频的视频帧所对应的视频语义信息的装置用于:Wherein, the means for determining video semantic information corresponding to the video frame of the video is used for:
    根据所述场景信息,结合所述语音和/或字幕信息,确定所述视频的视频帧所对应的视频语义信息。Determining video semantic information corresponding to the video frame of the video according to the scenario information, in combination with the voice and/or subtitle information.
  20. 根据权利要求11至19中任一项所述的处理设备,其中,所述处理设备还包括:The processing device according to any one of claims 11 to 19, wherein the processing device further comprises:
    用于获取一个或多个视频检索序列的装置;Means for obtaining one or more video retrieval sequences;
    用于将所述视频检索序列与所述视频语义信息进行匹配,以确定所述视频检索序列所对应的目标视频的装置。Means for matching the video retrieval sequence with the video semantic information to determine a target video corresponding to the video retrieval sequence.
  21. 一种计算机可读介质,其上存储有计算机程序,所述计算机程序可被处理器执行如权利要求1至10中任一项所述的方法。A computer readable medium having stored thereon a computer program, the computer program being executable by a processor, according to any one of claims 1 to 10.
  22. 一种计算机程序产品,当所述计算机程序产品被计算机设备执行时,如权利要求1至10中任一项所述的方法被执行。A computer program product, when the computer program product is executed by a computer device, the method of any one of claims 1 to 10 being performed.
  23. 一种计算机设备,所述计算机设备包括:A computer device, the computer device comprising:
    存储器,用于存储一个或多个计算机程序;a memory for storing one or more computer programs;
    一个或多个处理器,与所述存储器相连,One or more processors coupled to the memory,
    当所述一个或多个计算机程序被所述一个或者多个处理器执行时,使得 所述一个或多个处理器执行如权利要求1至10中任一项所述的方法。When the one or more computer programs are executed by the one or more processors, the one or more processors are caused to perform the method of any one of claims 1 to 10.
PCT/CN2019/072219 2018-01-25 2019-01-17 Method and apparatus for acquiring video semantic information WO2019144840A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810074371.2A CN108388836B (en) 2018-01-25 2018-01-25 Method and device for acquiring video semantic information
CN201810074371.2 2018-01-25

Publications (1)

Publication Number Publication Date
WO2019144840A1 true WO2019144840A1 (en) 2019-08-01

Family

ID=63077199

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/072219 WO2019144840A1 (en) 2018-01-25 2019-01-17 Method and apparatus for acquiring video semantic information

Country Status (2)

Country Link
CN (1) CN108388836B (en)
WO (1) WO2019144840A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388836B (en) * 2018-01-25 2022-02-11 北京一览科技有限公司 Method and device for acquiring video semantic information
CN109189986B (en) * 2018-08-29 2020-07-28 百度在线网络技术(北京)有限公司 Information recommendation method and device, electronic equipment and readable storage medium
CN113361313A (en) * 2021-02-20 2021-09-07 温州大学 Video retrieval method based on multi-label relation of correlation analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430689A (en) * 2008-11-12 2009-05-13 哈尔滨工业大学 Detection method for figure action in video
CN102663015A (en) * 2012-03-21 2012-09-12 上海大学 Video semantic labeling method based on characteristics bag models and supervised learning
CN104778224A (en) * 2015-03-26 2015-07-15 南京邮电大学 Target object social relation identification method based on video semantics
CN108388836A (en) * 2018-01-25 2018-08-10 北京览科技有限公司 A kind of method and apparatus for obtaining video semanteme information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102842033A (en) * 2012-08-17 2012-12-26 苏州两江科技有限公司 Human expression emotion semantic recognizing method based on face recognition
CN105740777B (en) * 2016-01-25 2019-06-25 联想(北京)有限公司 Information processing method and device
CN107590442A (en) * 2017-08-22 2018-01-16 华中科技大学 A kind of video semanteme Scene Segmentation based on convolutional neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430689A (en) * 2008-11-12 2009-05-13 哈尔滨工业大学 Detection method for figure action in video
CN102663015A (en) * 2012-03-21 2012-09-12 上海大学 Video semantic labeling method based on characteristics bag models and supervised learning
CN104778224A (en) * 2015-03-26 2015-07-15 南京邮电大学 Target object social relation identification method based on video semantics
CN108388836A (en) * 2018-01-25 2018-08-10 北京览科技有限公司 A kind of method and apparatus for obtaining video semanteme information

Also Published As

Publication number Publication date
CN108388836B (en) 2022-02-11
CN108388836A (en) 2018-08-10

Similar Documents

Publication Publication Date Title
US11830241B2 (en) Auto-curation and personalization of sports highlights
US10679063B2 (en) Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics
Dhall et al. Emotion recognition in the wild challenge 2013
US20190289359A1 (en) Intelligent video interaction method
KR102290419B1 (en) Method and Appratus For Creating Photo Story based on Visual Context Analysis of Digital Contents
EP2877254B1 (en) Method and apparatus for controlling augmented reality
US9244924B2 (en) Classification, search, and retrieval of complex video events
JP2020536455A5 (en)
EP3021233A1 (en) Electronic apparatus for generating summary content and method thereof
WO2019144840A1 (en) Method and apparatus for acquiring video semantic information
US20170065888A1 (en) Identifying And Extracting Video Game Highlights
CN111241340B (en) Video tag determining method, device, terminal and storage medium
JP2007281858A (en) Animation editing device
CN114342353A (en) Method and system for video segmentation
CN112423133B (en) Video switching method and device, computer readable storage medium and computer equipment
CN110740389A (en) Video positioning method and device, computer readable medium and electronic equipment
US9286710B2 (en) Generating photo animations
CN112312215B (en) Startup content recommendation method based on user identification, smart television and storage medium
CN109286848B (en) Terminal video information interaction method and device and storage medium
KR102144978B1 (en) Customized image recommendation system using shot classification of images
CN106851395A (en) Video broadcasting method and player
WO2017143951A1 (en) Expression feedback method and smart robot
US9697632B2 (en) Information processing apparatus, information processing method, and program
US10417356B1 (en) Physics modeling for interactive content
CN112165626B (en) Image processing method, resource acquisition method, related equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19743758

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/11/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 19743758

Country of ref document: EP

Kind code of ref document: A1