CN112822506A

CN112822506A - Method and apparatus for analyzing video stream

Info

Publication number: CN112822506A
Application number: CN202110089228.2A
Authority: CN
Inventors: 宋颖鑫; 廖玺举; 李远杭; 关云鹏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-05-18

Abstract

The present disclosure discloses a method and an apparatus for analyzing a video stream, relating to the field of artificial intelligence, in particular to the field of computer video technology, knowledge live broadcast and deep learning. The specific implementation scheme according to one embodiment is as follows: acquiring image data and audio data of a video stream; determining a first subject text capable of identifying a subject of the first text, from the first text corresponding to the audio data; determining a second subject text capable of identifying a subject of the second text from the second text corresponding to the image data using the image feature representation of the image data; and determining the theme of the video stream based on the first theme text and the second theme text. In this way, analysis of the video stream can be achieved, and the theme corresponding to the video stream can be accurately identified.

Description

Method and apparatus for analyzing video stream

Technical Field

The present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer video technology, knowledge live and deep learning, and more particularly to methods, apparatuses, electronic devices, computer readable storage media and computer program products for analyzing video streams.

Background

In the information age, videos are increasingly popular with people as one of information media. For example, live video streaming has become one of the best carriers to deliver knowledge. Since rich information such as expressed in images or audio is generally included in a video stream, a conventional scheme for determining a topic of the video stream by using a title of a video or a description text thereof may have difficulty in accurately representing the topic of the video stream for understanding the content of the video stream, particularly for understanding the content of a knowledge-based live video stream, and thus, there is a need for a scheme capable of accurately identifying the topic of the video stream.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, storage medium and computer program for analyzing a video stream.

According to a first aspect of the present disclosure, there is provided a method for analyzing a video stream, the method comprising acquiring image data and audio data of the video stream. The method also includes determining, from a first text corresponding to the audio data, a first subject text capable of identifying a subject of the first text. The method also includes determining, from a second text corresponding to the image data, a second subject text that is capable of identifying a subject of the second text using the image feature representation of the image data. The method also includes determining a topic of the video stream based on the first topic text and the second topic text.

According to a second aspect of the present disclosure, there is provided an apparatus for analyzing a video stream, the apparatus comprising a data acquisition module configured to acquire image data and audio data of the video stream. The apparatus also includes a first subject text determination module configured to determine, from a first text corresponding to the audio data, a first subject text capable of identifying a subject of the first text. The apparatus includes a second subject text determination module that determines a second subject text capable of identifying a subject of the second text from the second text corresponding to the image data using the image feature representation of the image data. The apparatus also includes a topic determination module configured to determine a topic of the video stream based on the first topic text and the second topic text.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to the first aspect of the disclosure.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method according to the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect of the present disclosure.

According to the scheme disclosed by the invention, the analysis of the video stream can be realized, and the theme corresponding to the video stream can be accurately identified.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 is a schematic diagram illustrating an example environment in which embodiments of the present disclosure can be implemented.

Fig. 2 illustrates a flow diagram of a method for analyzing a video stream, according to some embodiments of the present disclosure.

Fig. 3 illustrates a flow diagram of a method for determining a first subject text, according to some embodiments of the present disclosure.

Fig. 4 illustrates a flow diagram of a method for determining a second subject text, according to some embodiments of the present disclosure.

Fig. 5 shows a schematic diagram of a method for determining a second subject text, according to some embodiments of the present disclosure.

Fig. 6 shows a schematic block diagram of an apparatus for analyzing a video stream according to an embodiment of the present disclosure.

FIG. 7 illustrates a block diagram of an electronic device capable of implementing various embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In the description of embodiments of the present disclosure, the term "model" may learn from training data the associations between respective inputs and outputs, such that after training is completed, a given input is processed based on a trained set of parameters to generate a corresponding output. The "model" may also sometimes be referred to as a "neural network", "learning model", "learning network", or "network". These terms are used interchangeably herein.

As discussed above, in understanding a video stream, such as a knowledge-based live broadcast, that includes rich information for various dimensions (e.g., features of the image, text included in the image, and audio), conventional approaches fail to accurately understand it from multiple dimensions and thus fail to efficiently determine the topic of the video stream.

To address at least in part one or more of the above issues and other potential issues, embodiments of the present disclosure propose a solution for determining a theme of a video stream (e.g., of different time periods) from information of multiple dimensions included in the video stream, such as audio, text in an image, and features of the image. In this scheme, by performing speech recognition on audio data of a video stream and performing character recognition on image data of the video stream, an audio recognition text and an image recognition text corresponding to the video stream can be acquired. Based on the audio recognition text, and the image recognition text and its corresponding image feature information, two kinds of subject texts indicating the video stream can be extracted, respectively. Based on the two topic texts, at least one keyword can be extracted from the two topic texts as a tag to represent the topic of the video stream.

In this manner, a subject of the video stream may be more accurately identified based on various information included in the video stream, such as audio identification text information, image identification text information, and corresponding image characteristic information thereof, for example, to facilitate a user in subsequently being able to retrieve the video stream or locate a particular location of the video stream based on a tag associated with the subject.

FIG. 1 is a schematic diagram illustrating an example environment 100 in which various embodiments of the present disclosure can be implemented.

As shown in FIG. 1, the environment 100 includes a computing device 120 for processing an input video stream 110 to determine a topic 130 associated with the video stream 110. The video stream 110 may refer to different types of video that capture, record, process, store, transmit, and reproduce a series of still images as electrical signals. In some embodiments, in a live video stream, e.g., a knowledge-based live, it may include multiple video segments associated with multiple knowledge points. Accordingly, the computing device may determine, from the plurality of video segments of the video stream, a plurality of topics that are respectively associated with the plurality of video segments of the video stream 110.

Based on the video data 110, the computing device may extract data therefrom, such as audio data 104, image data 114, and so forth. It is to be understood that the image data 114 may be one or more frames of images extracted from a predetermined period of the video stream, and the image data 114 may further include text data (e.g., text and subtitles in the images). The audio data 104 may be audio data extracted from the video stream for the predetermined period of time.

The computing device 120 can be configured to process the audio data 104, such as Automatic Speech Recognition (ASR), to determine a first text 106 corresponding to the audio data. Computing device 120 may also be configured to process image data 114, for example, character recognition (such as optical character recognition, OCR) to determine second text 116 corresponding to image data 114. It will be appreciated that the first text may generally be divided into a plurality of natural sentences or phrases, and the second text may also be divided into a plurality of natural sentences or phrases.

Based on the first text 106 corresponding to the audio data 104 and the second text 116 corresponding to the image data 114, the computing device 120 may determine a first subject text 108 and a second subject text 118, respectively, for determining a subject 130 of the video stream 110. It is to be understood that the first subject text 108 may be composed of one or more sentences or one or more phrases in the first text 106. Similarly, the second subject text 118 may be composed of one or more sentences or one or more phrases in the second text 116.

Based on the first subject text 108 and the second subject text 118, the computing device may identify at least one subject 130. This may be accomplished, for example, by natural semantic understanding of the first subject text 108 and the second subject text 118. It is to be appreciated that the subject can generally be one or more key phrases in the first subject text 108 and the second subject text 118.

Based on the topic 130, the computing device can associate a tag related to the topic 130 with the video stream 110 for subsequent retrieval of the video stream, or for location by the tag to a particular point in time of the video stream, and so forth.

Computing device 120 may be any device with computing capabilities. By way of non-limiting example, the computing device 120 may be any type of stationary, mobile, or portable computing device, including but not limited to a desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, multimedia computer, mobile phone, or the like; all or a portion of the components of computing device 120 may be distributed in the cloud. Computing device 120 contains at least a processor, memory, and other components typically found in a general purpose computer to implement computing, storage, communication, control, and the like functions.

In some embodiments, various pre-trained neural network models may be included in the computing device 120. The pre-trained neural network model includes, but is not limited to, a natural semantic understanding model such as BERT (Bidirectional Encoder Representation from transforms), an image feature extraction model such as RESTNET 50 (depth residual neural network), a sequence annotation model such as Conditional Random Field (CRF), and the like. The model may be used to recognize, process and identify text or image features in a video stream. The use of the model will be described in detail below in conjunction with fig. 2-3. In some embodiments, a model may also refer to combining different models to form a combined model, e.g., BERT and CRF may be combined for labeling of audio recognition text.

Alternatively, in some embodiments, the computing device 120 may also select a suitable initial model to train to obtain the various pre-trained models described above. The initial training models include, but are not limited to, Support Vector Machine (SVM) models, bayesian models, random forest models, various deep learning/neural network models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and the like.

It should be understood that the architecture and functionality in environment 100 is described for exemplary purposes only and is not meant to imply any limitation on the scope of the disclosure. Embodiments of the present disclosure may also be applied to environments involving service authorization having different structures and/or functions.

A method according to an embodiment of the present disclosure will be described in detail below with reference to fig. 2 to 5. For ease of understanding, specific data mentioned in the following description are exemplary and are not intended to limit the scope of the present disclosure. For ease of description, a method according to an embodiment of the present disclosure is described below in conjunction with the exemplary environment 100 shown in FIG. 1. The methods according to embodiments of the present disclosure may be implemented in the computing device 120 shown in fig. 1 or other suitable devices, respectively. It is to be understood that methods in accordance with embodiments of the present disclosure may also include additional acts not shown and/or may omit acts shown, as the scope of the present disclosure is not limited in this respect.

Fig. 2 illustrates a flow diagram of a method 200 for analyzing a video stream, according to some embodiments of the present disclosure.

At 202, the computing device 120 may obtain the image data 114 and the audio data 104 of the video stream 110.

In particular, the computing device 120 may stream video stream data to obtain image data and audio data. In some embodiments, the computing device 120 may transcode the video signals of the video stream 110 to generate video stream data in a predetermined format for the streaming. In some embodiments, the video stream 110 is a real-time video stream, such as a knowledge-based live.

In some embodiments, the computing device 120 may extract video clips from the real-time video stream based on a predetermined time interval (e.g., every 3 minutes). For example, for 12 minutes of live video stream data, the computing device may perform the extraction of the video segments at 0-3 minutes, 3-6 minutes, 6-9 minutes, 9-12 minutes. Based on the video segments, the computing device may determine image data and audio data, and based thereon, determine topics associated with the video segments, respectively.

In some embodiments, the predetermined time interval may be determined based on a change in image data of the video stream. In particular, the computing device may determine the predetermined time interval in response to detecting a change in image data in the video stream (e.g., a change due to a knowledge-based live presenter switching the presentation document file PPT, etc.). In some embodiments, the predetermined time interval may be set to correspond to a time to teach, for example, one or two pages of PPT. Other suitable techniques may also be applied to determine image data and audio data from video stream 110, as the present disclosure is not limited in this respect.

In this manner, the computing device may make real-time theme determinations for different time periods of a real-time video stream to facilitate subsequent utilization of the video stream.

At 204, the computing device 120 may determine, from the first text 106 corresponding to the audio data 104, a first subject text 108 that is capable of identifying a subject of the first text 106.

A specific process of determining the first subject text is described in detail below with reference to fig. 3. Fig. 3 illustrates a flow diagram of a method 300 for determining a first subject text, according to some embodiments of the present disclosure.

At 302, computing device 120 may perform speech recognition processing on audio data 104 to obtain first text 106.

In some embodiments, the computing device 120 may also process the audio data to remove blank or paused segments therefrom and perform an audio-to-text data conversion of the processed audio data. It will be appreciated that the speech recognition process may be performed using a variety of conventional speech recognition techniques. It is understood that a plurality of natural sentences may be included in the first text obtained from the audio data.

At 304, the computing device 120 may label the first text 106 based on the natural semantic understanding model to determine a labeled first text.

In particular, computing device 120 may apply a semantic understanding model, such as BERT or two-way long-short term memory network LSTM-based, to the obtained first text described above to obtain a feature representation (e.g., a feature vector) of the first text. The computing device 120 may apply a sequence annotation model, such as CRF, to the feature representation of the first text, for example, employing a BIO annotation mode to annotate each character in the first text. Where "B" may refer to the beginning of the first subject text, "I" denotes in the first subject text, and "O" denotes content that does not belong to the first subject text.

In one example, for audio data containing a natural sentence "today we speak" the computing device may utilize the model described above, label each character in "today we speak" as "O", label "cause" as "B", and label each character in "formula decomposition" as "I".

Accordingly, at 306, the computing device 120 may determine the first subject text 108 based on the annotated first text. For example, in the above example employing the BIO tagging mode, the computing device may determine that the first subject text is "factorized" based on the characters tagged as "B" and "O".

In this way, the computing device can understand the information included in the audio data to extract information related to the subject of the video stream therein as subject text.

In some embodiments, for semantic understanding models based on training models such as BERT, the computing device may train a pre-training model with a small number of labeled training samples (e.g., text corresponding to 3000 audio data) to fine-tune its parameters. For example, text converted from audio data of a knowledge-based live video stream will typically include some textual description of the prompt for subject text, such as "we share … … today", "we speak … … today", "next we speak … …", etc. Therefore, such features included in the training samples can be labeled, and the semantic understanding model can be trained by using the labeled training samples, so that the semantic understanding model can identify the first text corresponding to the audio data in the input video stream more accurately.

Referring back to fig. 2, at 206, the computing device 120 may determine a second subject text that is capable of identifying a subject of the second text from the second text corresponding to the image data using the image feature representation of the image data.

Fig. 4 illustrates a flow diagram of a method 400 for determining a second subject text, in accordance with some embodiments of the present disclosure.

At 402, computing device 120 may perform an optical character recognition process on image data 114 to obtain second text 116.

In particular, the computing device may perform key frame decimation on the video stream at a predetermined frequency to determine one or more frames of images as image data 114 for retrieval of the second text 116. For example, when the video stream within the predetermined time interval includes two pages of PPTs, the computing device may extract two frames of images corresponding to the two pages of PPTs, respectively, for acquisition of the second text 116.

It will be appreciated that the optical character recognition process can be performed using a variety of conventional OCR techniques. It is to be understood that one or more natural sentences or phrases may be included in the second text obtained from the image data.

Referring back to fig. 4, at 404, computing device 120 may perform an object detection process on image data 114 to determine at least one location of at least one sub-text included in second text 116.

At 406, the computing device 120 may determine the second subject text 118 from the at least one sub-text based on the at least one location and utilizing the image feature representation.

The process of

steps

404 and 406 will be described in detail with reference to fig. 5. Fig. 5 illustrates a schematic diagram 500 for determining a second subject text, according to some embodiments of the present disclosure.

In particular, based on the natural semantic understanding model, the computing device 120 may determine at least one sub-text feature representation corresponding to the at least one sub-text. In one particular example shown in fig. 5, image data 514 includes a page of PPT images on which optical character recognition processing can obtain 3 natural sentences (i.e., 3 sub-texts 534) "conversion of kinetic and gravitational potential energy", "which energies are possessed by 1, rolling and simple pendulum when in motion? ", and" 2, how does kinetic and gravitational potential energy translate? Thus, for sub-text 534, the computing device may apply a natural semantic understanding model 536 to obtain its corresponding sub-text feature representation 538. The process of processing the sub-text included in the second text corresponding to the image data by using the natural semantic understanding model 536 to obtain the feature representation is similar to the process of processing the sub-text corresponding to the audio data by using the natural semantic understanding model 536 to obtain the feature representation described above, and will not be described again here.

With the target detection processing, the computing device may acquire three tiles corresponding to the three sub-texts, and acquire three positions (also sometimes referred to herein as position data) of the three tiles, respectively. The location may be represented, for example, using coordinates, e.g., for a given tile, the coordinates of a given point of the tile (e.g., the lower left corner, the upper right corner, or the center point of the tile) relative to the origin may be determined, with the lower left corner of the image as the origin, based on which the location data 524 for the tile is identified.

The computing device 120 can determine, based on the at least one location, at least one region of interest occupied by the at least one sub-text in the image data. For example, taking "the conversion of kinetic energy and gravitational potential energy" as an example, the computing device may determine a region of a predetermined range (e.g., including at least an underline thereof) surrounding and a space of the image occupied by the sub-document as a region of interest (ROI).

The computing device 120 may determine at least one first feature representation for the region of interest from the image feature representations based on the at least one region of interest.

In particular, the computing device may apply an image feature extraction model 516, such as RESTNET 50, to the image data 514 (e.g., in its entirety) to obtain an image feature representation 518, such as a feature map, corresponding to features in the image data 514. It will be appreciated that the feature map may characterize the features of each of a plurality of region units of the image region data. The computing device may perform dimensionality reduction on the image feature representation 518 based on the location (e.g., coordinates) of the region of interest, e.g., in a manner processed by ROI pooling 526, to obtain a first feature representation 528 for the region of interest.

For example, also taking the sub-text "conversion of kinetic and gravitational potential energy" as an example, the computing device may extract a first feature representation 528 corresponding to a region of interest (ROI) from the image feature representations 518 corresponding to the entirety of the image data using the ROI corresponding to the sub-text. For other sub-texts included in the image data, the computing device may perform similar processing to obtain other corresponding first feature representation(s), respectively.

In some embodiments, the image feature representation 518 may also be processed by the computing device, such as maximum pooling or average pooling, to obtain a feature representation of suitable dimensions for subsequent processing.

Based on the at least one first feature representation and the image feature representation, the computing device 120 may determine a second subject text from the at least one sub-text.

For example, for one specific example shown in fig. 5, the "conversion of kinetic energy and gravitational potential energy" is obviously a natural sentence containing a subject, and the characteristics of the corresponding image area (e.g., the sub-text may have at least one of the characteristics of bold font, and/or underlined font, large font size, and text indented before) are also different from the characteristics of other areas of the image (e.g., at least one of the other characteristics of "1, what energy is the scroll and simple pendulum in motion?and" 2, how is the conversion between kinetic energy and gravitational potential energy, "may have the characteristics of bold font, and/or not underlined font, small font size, and text indented after). Accordingly, for the image data corresponding to the frame of image, the computing device may utilize the image features to determine the second subject text from the at least one sub-text.

In some embodiments, semantics of the respective sub-texts included in the second text may also be considered in combination to more accurately identify the second subject text. Accordingly, the computing device 120 may determine the second subject text using the at least one sub-text feature representation, the at least one first feature representation, and the image feature representation.

In particular, for a given sub-text "kinetic and gravitational potential energy conversion," the computing device 120 may perform feature fusion 540 (e.g., concatenate the plurality of feature representations) with a first feature representation 526 (e.g., in the form of a feature vector, which may characterize the semantics of the sub-text), an image feature representation 518 (e.g., in the form of a feature vector), and a sub-text feature representation 538 (e.g., in the form of a feature vector) of the sub-text to obtain a fused feature representation (e.g., a feature vector) characterizing the sub-text. The fused feature vector of the sub-text may represent feature information about the text semantics of the sub-text, image features of a region of interest corresponding to the sub-text (e.g., the sub-text may have at least one of the features of a font that is bold and/or underlined, a font size that is large, text indentation that is early, etc.), and image features of the entirety of the image data (e.g., at least one of the features of other at least one sub-text "1, what energy is a scroll and a simple pendulum that has while in motion.

The computing device 120 may feed the fused feature representation into the classifier 550 to determine whether the sub-text is the second topic text. For example, for the sub-text "transformation of kinetic and gravitational potential energy," the computing device may classify it as belonging to the subject text, while for the sub-text "1, rolling and simple pendulum have which energies when in motion? ", and" 2, how does kinetic and gravitational potential energy translate? ", the computing device may then classify it as not belonging to the subject text. Although only classifiers that classify tasks are used above as one specific example for determining whether a sub-text belongs to a subject text, it is understood that other techniques that can be used to determine whether a subject text belongs to based on a feature representation can also be applied herein.

In this way, the computing device can understand a variety of information included in the image data to extract information related to the subject of the video stream therein as subject text.

Referring back to fig. 2, at 208, the computing device 120 may determine a topic of the video stream based on the first topic text and the second topic text.

In particular, since the video stream 110 generally consists of image, audio, text and other elements, it is desirable to understand the video content more accurately to determine the subject, and it is necessary to merge these different pieces of information to obtain a better recognition effect on the subject. The computing device may combine the first subject text and the second subject text to obtain a combined subject text, and extract at least one key phrase from the combined subject text as a subject. The determination of key phrases may also employ a natural semantic understanding model. In some embodiments, at least one key phrase may be extracted from the first subject text and the second subject text, respectively, as the above subject. In other embodiments, the number of at least one key phrase may be one, and the disclosure is not limited thereto.

According to embodiments of the present disclosure, text features included in images, image features, and features included in audio, for example, may be considered together to more accurately understand a video stream to determine a topic indicative of the video stream (e.g., one or more video segments therein). Subsequent tagging of the video stream (e.g., multiple time periods) with the determined topic, retrieval of the video stream with the topic, locating a particular point in time in the video stream, etc. may thereby be facilitated.

Fig. 6 shows a schematic block diagram of an apparatus 600 for analyzing a video stream according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes a data acquisition module 602 configured to acquire image data and audio data of a video stream. The apparatus 600 further comprises a first subject text determination module 604 configured to determine, from a first text corresponding to the audio data, a first subject text capable of identifying a subject of the first text. The apparatus 600 includes a second subject text determination module 606 that determines a second subject text that identifies a subject of the second text from the second text corresponding to the image data using the image feature representation of the image data. The apparatus 600 further comprises a topic determination module 608 configured to determine a topic of the video stream based on the first topic text and the second topic text.

In some embodiments, the data acquisition module 602 includes: a video extraction sub-module configured to extract video segments from a real-time video stream based on a predetermined time interval; and an image and audio data determination sub-module configured to determine image data and audio data based on the video clip.

In some embodiments, the first topic text determination module 604 includes: a voice recognition sub-module configured to perform voice recognition processing on the audio data to obtain a first text; a semantic understanding sub-module configured to label the first text based on a natural semantic understanding model to determine a labeled first text; and a first subject text determination sub-module configured to determine a first subject text based on the annotated first text.

In some embodiments, the second topic text determination module 606 includes: an optical character recognition sub-module configured to perform optical character recognition processing on the image data to acquire a second text; a text target detection sub-module configured to perform target detection processing on the image data to determine at least one position of at least one sub-text included in the second text; a second subject text determination sub-module configured to determine a second subject text from the at least one sub-text based on the at least one location and using the image feature representation.

In some embodiments, the second topic text determination sub-module comprises: a region-of-interest determining unit configured to determine at least one region of interest occupied by the at least one sub-text in the image data based on the at least one position; a first feature representation determination unit configured to determine at least one first feature representation for the region of interest from the image feature representation on the basis of the at least one region of interest; and a second subject text determination unit configured to determine a second subject text from the at least one sub-text based on the at least one first feature representation and the image feature representation.

In some embodiments, the second subject text determination unit includes: a sub-text feature representation sub-unit configured to determine at least one sub-text feature representation corresponding to the at least one sub-text based on a natural semantic understanding model; and a second subject text determination subunit configured to determine a second subject text using the at least one sub-text feature representation, the at least one first feature representation, and the image feature representation.

In some embodiments, the topic determination module 608 includes: a subject text combining sub-module configured to combine the first subject text and the second subject text to obtain a combined subject text; and a key phrase extraction module configured to extract at least one key phrase from the combined topic text as a topic.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the

methods

200, 300, and 400. For example, in some embodiments, any of the

methods

200, 300, and 400 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of any of

methods

200, 300, 500, and 600 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform any of the

methods

200, 300, and 400.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for analyzing a video stream, comprising:

acquiring image data and audio data of the video stream;

determining a first subject text capable of identifying a subject of the first text according to the first text corresponding to the audio data;

determining a second subject text capable of identifying a subject of a second text corresponding to the image data from the second text using the image feature representation of the image data; and

determining the topic of the video stream based on the first topic text and the second topic text.

2. The method of claim 1, wherein acquiring the image data and the audio data comprises:

extracting video segments from the real-time video stream based on a predetermined time interval; and

based on the video segment, the image data and the audio data are determined.

3. The method of claim 1, wherein determining the first subject text comprises:

performing voice recognition processing on the audio data to obtain the first text;

labeling the first text based on a natural semantic understanding model to determine the labeled first text; and

determining the first subject text based on the annotated first text.

4. The method of claim 1, wherein determining a second subject text that can identify a subject of the second text comprises:

performing optical character recognition processing on the image data to acquire the second text;

performing object detection processing on the image data to determine at least one position of at least one sub-text included in the second text;

determining the second subject text from the at least one sub-text based on the at least one location and with the image feature representation.

5. The method of claim 4, wherein determining the second subject text from the at least one sub-text based on the at least one location and utilizing the image feature representation comprises:

determining, based on the at least one location, at least one region of interest occupied by the at least one sub-text in the image data;

determining at least one first feature representation for the region of interest from the image feature representations based on the at least one region of interest; and

determining the second subject text from the at least one sub-text based on the at least one first feature representation and the image feature representation.

6. The method of claim 5, wherein determining the second subject text from the at least one sub-text comprises:

determining at least one sub-text feature representation corresponding to the at least one sub-text based on a natural semantic understanding model; and

determining the second subject text using the at least one sub-text feature representation, the at least one first feature representation, and the image feature representation.

7. The method of claim 1, wherein determining the subject matter of the video stream comprises:

combining the first subject text and the second subject text to obtain a combined subject text; and

extracting at least one key phrase from the combined topic text as the topic.

8. An apparatus for analyzing a video stream, comprising:

a data acquisition module configured to acquire image data and audio data of the video stream;

a first subject text determination module configured to determine, from a first text corresponding to the audio data, a first subject text capable of identifying a subject of the first text;

a second subject text determination module that determines, from a second text corresponding to the image data, a second subject text capable of identifying a subject of the second text, using the image feature representation of the image data; and

a topic determination module configured to determine the topic of the video stream based on the first topic text and the second topic text.

9. The apparatus of claim 8, wherein the data acquisition module comprises:

a video extraction sub-module configured to extract video segments from the real-time video stream based on a predetermined time interval; and

an image and audio data determination sub-module configured to determine the image data and the audio data based on the video segment.

10. The apparatus of claim 8, wherein the first subject text determination module comprises:

a voice recognition sub-module configured to perform voice recognition processing on the audio data to obtain the first text;

a semantic understanding sub-module configured to label the first text based on a natural semantic understanding model to determine a labeled first text; and

a first subject text determination sub-module configured to determine the first subject text based on the annotated first text.

11. The apparatus of claim 8, wherein the second topic text determination module comprises:

an optical character recognition sub-module configured to perform optical character recognition processing on the image data to acquire the second text;

a text target detection sub-module configured to perform target detection processing on the image data to determine at least one position of at least one sub-text included in the second text;

a second subject text determination sub-module configured to determine the second subject text from the at least one sub-text based on the at least one location and with the image feature representation.

12. The apparatus of claim 11, wherein the second topic text determination sub-module comprises:

a region-of-interest determination unit configured to determine, based on the at least one position, at least one region of interest occupied by the at least one sub-text in the image data;

a first feature representation determination unit configured to determine at least one first feature representation for the region of interest from the image feature representation on the basis of the at least one region of interest; and

a second subject text determination unit configured to determine the second subject text from the at least one sub-text based on the at least one first feature representation and the image feature representation.

13. The apparatus of claim 12, wherein the second subject text determining unit comprises:

a sub-text feature representation sub-unit configured to determine at least one sub-text feature representation corresponding to the at least one sub-text based on a natural semantic understanding model; and

a second subject text determination subunit configured to determine the second subject text using the at least one sub-text feature representation, the at least one first feature representation, and the image feature representation.

14. The apparatus of claim 8, wherein the topic determination module comprises:

a subject text combining sub-module configured to combine the first subject text and the second subject text to obtain a combined subject text; and

a key phrase extraction module configured to extract at least one key phrase from the combined topic text as the topic.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.