CN113326387A

CN113326387A - Intelligent conference information retrieval method

Info

Publication number: CN113326387A
Application number: CN202110603641.6A
Authority: CN
Inventors: 孟强祥; 田俊麟; 宋昱
Original assignee: Introduction Of Chinese Technology Shenzhen Co ltd
Current assignee: Introduction Of Chinese Technology Shenzhen Co ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-31
Anticipated expiration: 2041-05-31
Also published as: CN113326387B

Abstract

The invention discloses an intelligent retrieval method for meeting information, which relates to the technical field of meeting records and comprises the following steps: recording conference information, recording and extracting audio streams of conference video contents in a multimedia mode in the whole process in real time, sending the audio streams to a voice recognition module to convert voice into character information, storing the character information, marking according to the conference progress time, inputting text information or voice information to inquire, matching and inquiring with the conference information stored previously, and returning corresponding audio or video information. According to the invention, the conference information is recorded in a multimedia manner, the retrieval information and the conference information are matched and inquired through multi-level processing, when the conference record is matched, the information of the time axis where the corresponding record is located is displayed, and simultaneously, the audio information which enables a user to intuitively hear the speaking of the current conference is played, so that the later analysis and understanding of the conference are more convenient, and the conference record retrieval experience is greatly improved.

Description

Intelligent conference information retrieval method

Technical Field

The invention relates to the technical field of conference recording, in particular to an intelligent conference information retrieval method.

Background

As technology advances, many products that automatically record conference content are being launched. From the earliest recorders to automated speech-to-text equipment. These recording methods record a lot of contents, which often last for several hours. Resulting in time and effort for reviewing or retrieving the meeting record. Although some advanced products label conference participants according to human biological characteristics such as voiceprints, fingerprints and the like, and then quickly locate conference recording contents through the labels, even labeling by using geographic information and administrative levels, the advanced products have the disadvantages of being not humanized, such as: the conference records can not be inquired and retrieved according to the content, the inquiry record mode is single, and the conference records can only be watched and listened back manually and can not be positioned quickly.

Disclosure of Invention

The invention aims to provide an intelligent conference information retrieval method to overcome the defects in the prior art.

In order to achieve the above purpose, the invention provides the following technical scheme: an intelligent conference information retrieval method comprises the following steps:

recording conference information in a multimedia mode in real time in the whole process, wherein the whole process comprises archiving in the forms of whole video, audio, text and the like of a conference;

step two, extracting the audio stream of the conference video content, copying the audio stream from a media file or a Container (Container) of a stream file by using demultiplexing (demux) to extract the audio stream from the video stream, and sending the audio stream to a voice recognition module to convert voice into text information and store the text information, wherein the original video file is unchanged;

marking the video, audio and text of the conference record according to the time of the conference, taking a speaking detection technology or a silence detection technology as a starting and ending judgment basis, further combining with a context judgment technology of NLP (natural language processing) including but not limited to SBD (sequence boundary prediction) and WS (word segmentation) with smaller granularity to process the speaking content as a unit or a word as a unit, and adding marks on the processed conference record content according to sentences and words and storing the processed conference record content;

step four, the user searches the conference record, inputs text information or voice information for inquiry, converts the voice into a text through the voice-to-text module if the voice information is received, matches and inquires the text with the conference information stored previously, returns corresponding audio or video information and attaches the text information converted by the voice;

and step five, when the user views the returned result, the recorded content of the context can be quickly retrieved, namely, the user can simultaneously view the conference information before and after the retrieved time period, the recorded content is displayed to the user in text, audio or video information through highlighting, and the user can intuitively position, select and modify the corresponding content.

Preferably, in the first step, if the conference is a network video conference, the conference information is directly obtained through the network, and if the conference is a non-network conference, the conference is recorded through multimedia devices such as audio recording and video recording, and is extracted and converted.

Preferably, the text information converted by the voice in the second step can be used for displaying and recording the real-time conference subtitles while being stored.

Preferably, the time interval marked in step three is marked by a sentence or a pause in the audio containing the content of the utterance.

Preferably, the marked video Segments, audio Segments and text Segments in the third step are stored in a one-to-one correspondence with a time sequence table, wherein the video Segments are recorded in a list vsrl (video Segments Recording list) in a time sequence, the audio Segments are recorded in a list ssrl (speech Segments Recording list) in a time sequence, and the text segment information is recorded in a list tsrl (text Segments Recording list) in a time sequence.

Preferably, the matching process in the fourth step includes the following steps:

step a, first-level character matching, wherein text information generated by user searching is used for matching text information stored in a TSRL, if the text information can be matched, audio information of a corresponding time period is returned, and if corresponding video information exists, the video information of the corresponding time period is directly returned.

B, second-level character matching, if the first level can not be matched, reducing the text information to smaller granularity through SBD for matching again, if the text information can be matched, returning the corresponding audio or video information,

and c, second-stage processing, namely decomposing the information into smaller granularity for re-matching through WS if the second-stage processing cannot be matched, returning corresponding audio or video information if the information can be matched, and otherwise, ensuring that the query information cannot be matched.

In the technical scheme, the invention provides the following technical effects and advantages:

the invention records the meeting information in a multimedia mode, marks and stores the video, audio and text recorded by the meeting according to the meeting running time, and users search and match through the text information and perform multi-stage processing, matching and inquiring the retrieval information and the conference information, when the conference record is matched, displaying the information of the time axis where the corresponding record is located, the user can select the text information through the interactive equipment, the corresponding audio is highlighted, and simultaneously, the audio information which enables the user to intuitively hear the speaking of the current conference is played, the user can randomly select any paragraph in the text module, the corresponding audio or video will be positioned and played synchronously, otherwise, the user will quickly retrieve the audio or video content, corresponding text information can also be displayed immediately, so that the later analysis and understanding of the conference are more convenient, and the experience of conference record retrieval is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of a query matching process of the present invention.

FIG. 3 is an exemplary diagram of an interaction interface when the present invention returns a result.

Fig. 4 is a diagram of another example of a pickup interface for the case where only audio and text information are returned as a result according to the present invention.

FIG. 5 is an exemplary diagram of an interface for a user to select a query message status in the state of FIG. 4 according to the present invention.

Description of reference numerals:

A. a video information display module; B. a video information clip display module of a time axis; C. a text information display module; D. an audio information display module; E. and a time position display module.

Detailed Description

In order to make the technical solutions of the present invention better understood, those skilled in the art will now describe the present invention in further detail with reference to the accompanying drawings.

The invention provides an intelligent conference information retrieval method, which comprises the following steps:

recording conference information, recording the whole process in a multimedia mode in real time, filing the whole conference in forms of whole video, audio, text and the like, directly acquiring the conference information through a network if the conference is a network video conference, and recording the conference through multimedia equipment such as sound recording and video recording if the conference is a non-network conference, and extracting and converting;

step two, extracting the audio stream of the conference video content, copying the audio stream from a media file or a Container (Container) of a stream file by using demultiplexing (demux) to extract the audio stream from the video stream, and sending the audio stream to a voice recognition module to convert voice into text information and store the text information while the original video file is kept unchanged, wherein the audio stream can be used for displaying and recording a real-time conference subtitle;

marking the video, audio and text of the conference record according to the time of the conference, taking a talk detection technology or a silence detection technology as a starting and ending judgment basis, taking a sentence or a pause containing the talk content in the audio as a mark at a time interval, further combining with a context judgment technology of NLP (natural language processing) including but not limited to SBD (sequence boundary prediction) and WS (word segmentation) with smaller granularity to process the talk content according to a sentence unit or a word unit, and respectively adding marks according to a sentence and a word and storing the processed conference record content;

the marked video Segments, audio Segments and text fields are respectively stored in a one-to-one correspondence way by setting a time sequence table, wherein the video Segments are recorded in a list VSRL (video Segments Recording List) according to the time sequence, the audio Segments are recorded in a list SSRL (speech Segments Recording List) according to the time sequence, the text segment information is recorded in a list TSRL (text Segments Recording List) according to the time sequence, and the structures of the VSRL, the SSRL and the TSRL are respectively shown in a table 1, a table 2 and a table 3:

TABLE 1 VSRL example

Sequence No.	Time Offset	Duration	SegmentsURL
				0	00:00:00.000	1000	VS001.mp4
1	00:00:01.000	1000	VS002.mp4
				2	00:00:02.000	1500	VS003.mp4
…	…	…	…

Wherein,

sequence No. represents a mark serial number, and the key value of the mark similarity relation table is unique and is corresponding to the SSRL and the TSRL;

time Offset represents the Offset from the entire video, from the beginning to the current;

duration represents the time length of the current segment in milliseconds ms;

segmentsrurl indicates the URL information of the video file storing the current segment; the streaming media player can directly play the corresponding video by using the URL; in actual use, the address should be further encrypted, and the data security is improved through encryption.

TABLE 2 SSRL examples

Sequence No.	Time Offset	Duration	SegmentsURL
				0	00：00：00.000	1000	SS001.wav
1	00：00：01.000	1000	SS002.wav
				2	00：00：02.000	1500	SS003.wav
…	…	…	…

Wherein,

sequence No. indicates a tag number, which is the same as VSRL;

duration represents the time length of the current segment in milliseconds ms;

segmentsrurl indicates the audio file URL information that stores the current segment; the streaming media player can directly play the corresponding audio by using the URL; in actual use, the address should be further encrypted, and the data security is improved through encryption.

Wherein

Sequence No._VSRL＝Sequence No._SSRL＝Sequence No._TSRL

TABLE 3 TSRL example

Wherein,

sequence No. indicates a tag number, which is the same as VSRL;

the Original Language Code represents the Language of an Original text, and is represented by an ISO-639-1 standard, wherein en is English, zh is Chinese and the like;

code Page, representing character set of literal Code, 1209 UTF-8 Unicode;

characters, representing a file URL where text is stored;

the text matching process comprises the following steps:

step c, second level processing, if the second level can not be matched, the information is decomposed into smaller granularity through WS and matched again, if the information can be matched, the corresponding audio or video information is returned, otherwise, the inquiry information can not be matched

To sum up, the invention records in a multimedia manner, includes filing of the whole video, audio, text and other forms of the conference, sends an audio stream to a voice recognition module to convert voice into character information, marks the video, audio and text recorded in the conference according to the time of the conference, and stores the video, audio and text in one-to-one correspondence according to the time mark, a user inquires by inputting text information or voice information, if the voice information is received, converts the voice into text through a voice-to-text module, matches and inquires the conference information through multi-stage processing, when the conference record is matched, displays the information of the time axis where the corresponding record is located, including the information of upper and lower paragraphs, the user can select the text information through an interactive device such as a mouse or a touch screen, the text information is highlighted, and the corresponding audio is also highlighted, meanwhile, audio information of the current conference speech is played by the user visually, if video information corresponding to the recorded video information exists, a corresponding video clip is played, the user can randomly select any paragraph in the text module, the corresponding audio or video can be synchronously positioned and played, otherwise, the user quickly retrieves the audio or video content, and the corresponding text information can be displayed immediately, so that the later analysis and understanding of the conference are more convenient, and the experience of conference record retrieval is greatly improved.

While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that the described embodiments may be modified in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are illustrative in nature and should not be construed as limiting the scope of the invention.

Claims

1. An intelligent conference information retrieval method is characterized by comprising the following steps:

2. The intelligent conference information retrieval method according to claim 1, wherein: in the first step, if the conference is a network video conference, the conference information is directly acquired through the network, and if the conference is a non-network conference, the conference is recorded through multimedia equipment such as audio recording and video recording, and extraction and conversion are performed.

3. The intelligent conference information retrieval method according to claim 1, wherein: and the text information converted by the voice in the second step can be used for displaying and recording the real-time conference subtitles while being stored.

4. The intelligent conference information retrieval method according to claim 1, wherein: the time interval marked in step three is marked by a sentence or a pause in the audio containing the content of the utterance.

5. The intelligent conference information retrieval method according to claim 1, wherein: the marked video Segments, audio Segments and text Segments in the third step are respectively set with a time sequence table for one-to-one storage, wherein the video Segments are recorded in a list VSRL (video Segments Recording List) according to the time sequence, the audio Segments are recorded in a list SSRL (speech Segments Recording List) according to the time sequence, and the text segment information is recorded in a list TSRL (text Segments Recording List) according to the time sequence.

6. The intelligent retrieval method for meeting information of claim 5, wherein the matching process in the fourth step comprises the following steps:

step a, first-level character matching, wherein text information generated by user searching is used for matching text information stored in a TSRL, if the text information can be matched, audio information of a corresponding time period is returned, and if corresponding video information exists, the video information of the corresponding time period is directly returned;

b, second-level character matching, wherein if the first level character matching cannot be achieved, the text information is reduced to smaller granularity through SBD and matched again, and if the text information can be matched, the corresponding audio or video information is returned;