CN116013306A

CN116013306A - Conference record generation method, device, server and storage medium

Info

Publication number: CN116013306A
Application number: CN202211713319.XA
Authority: CN
Inventors: 唐少杰; 罗生; 刘小东; 康大龙
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-25

Abstract

The application discloses a conference record generation method, a device, a server and a storage medium, wherein the method is used for the server in a conference system and comprises the following steps: acquiring a current voice subtitle of a current speaking client; determining target text cache data corresponding to a current speaking client; updating first current conference data local to a server based on the current voice subtitle and target text cache data to obtain updated first current conference data; the updated first current conference data is sent to the client so that the client can update the current record data in the conference record document based on the first current conference data; if the current voice subtitle has the whole sentence ending mark, a preset mark is sent to the client so that the client can store the updated current record data as manuscript-setting record data in the conference record document, and new current record data is generated according to a preset rule. The method and the device can obtain complete semantics and generate the semantic integrity of the conference record in real time.

Description

Conference record generation method, device, server and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, a server, and a storage medium for generating a conference record.

Background

In the related art, a real-time voice recognition technology is generally adopted in a video conference to convert an audio stream of a participant in the conference into characters in real time, and then a caption and a conference record are formed, wherein the caption is transmitted to a client for display. The conference record is a formal file, the format of the conference record cannot be scattered like subtitles, and sentence breaking is needed according to the context of a speaker.

However, in the existing scheme, the processing of the conference record break sentence cannot truly reflect the speaking content of the speaker, so that semantic integrity cannot be ensured, and thus, the conference record cannot be generated in real time.

Content of the application

The main purpose of the application is to provide a method, a device, a server and a storage medium for generating conference records, which aim to solve the technical problem that the conference records cannot be generated in real time.

In order to achieve the above object, the present application provides a method for generating a conference record, which is used for a server in a conference system, and the method includes:

acquiring a current voice subtitle of a current speaking client of a conference; the current voice subtitle is obtained by converting short sentence voice data of a current speaking client into text;

Determining target text cache data corresponding to a current speaking client;

updating first current conference data local to a server based on the current voice subtitle and target text cache data to obtain updated first current conference data;

the updated first current conference data is sent to the client so that the client can update the current record data in the conference record document based on the first current conference data;

if the current voice subtitle has the whole sentence ending mark, the preset mark is sent to all clients, so that the clients can store the updated current record data as manuscript-fixing record data in the conference record document, and new current record data is generated according to preset rules.

In a possible embodiment of the present application, based on the current voice subtitle and the target text cache data, updating first current conference data local to the server, and before obtaining the updated first current conference data, the method further includes:

extracting a result type identifier of the current voice subtitle;

if the result type identifier is a short sentence ending identifier and does not have a whole sentence ending identifier, updating the target text cache data based on the current voice subtitle.

In a possible embodiment of the present application, after extracting the result type identifier of the current voice subtitle, the method further includes:

and if the result type identifier is a short sentence ending identifier and has a whole sentence ending identifier, the target text cache data is emptied.

In a possible embodiment of the present application, after the extracting the result type identifier of the current voice subtitle, the method further includes:

judging whether the target text cache data is blank data or not;

and if the voice subtitle is blank data, judging whether the current voice subtitle has the whole sentence ending mark or not.

In a possible embodiment of the present application, updating first current conference data local to a server based on current voice subtitles and target text buffer data, and obtaining updated first current conference data includes:

if the current voice subtitle does not have the whole sentence ending mark, updating first current conference data local to the server based on the current voice subtitle and the target text cache data to obtain current conference intermediate data;

if the result type identifier is provided with the whole sentence ending identifier, updating first current conference data local to the server based on the current voice subtitle and the target text cache data to obtain manuscript conference data, wherein the manuscript conference data carries a preset identifier.

In a possible embodiment of the present application, after determining the target text cache data corresponding to the current speaking client, the method further includes:

judging whether other text cache data except the target text cache data in all the text cache data are blank data or not;

if the non-uniformity is blank data, taking at least one client corresponding to other text cache data which is not blank data as a broken speaking client;

adding a short sentence ending identifier and a preset breaking identifier to the latest voice subtitle of the broken speaking client, and updating other text cache data which are not blank data to obtain broken text cache data;

extracting at least one timestamp interrupting the text cache data;

according to the time stamp, obtaining at least one sequencing result of breaking the text cache data;

and sequentially sending at least one breaking text buffer data to the client according to the sequencing result, so that the client updates the current record data in the conference record document based on the breaking text buffer data, saves the updated current record data as manuscript-fixing record data in the conference record document, and generates new current record data according to a preset rule.

In a possible embodiment of the present application, after determining whether all text buffer data except the target text buffer data is blank data, the method further includes:

and if the first conference data are blank data, executing the first current conference data based on the current voice subtitle and the target text cache data, and updating the first current conference data of the local server to obtain updated first current conference data.

In a second aspect, the present application further provides a conference record generating device, configured on a server of a conference system, where the device includes:

the text acquisition module is used for acquiring the current voice subtitle of the current speaking client side of the conference; the current text information is obtained by converting voice data of a current speaking client into text;

the cache determining module is used for determining target text cache data corresponding to the current speaking client side;

the conference data updating module is used for updating the first current conference data of the local server based on the current voice subtitle and the target text cache data to obtain updated first current conference data;

the record sending module is used for sending the updated first current conference data to all clients so that all clients update conference record documents local to the clients based on the first current conference data;

And the record storage module is used for sending preset identifiers to all clients if the current voice subtitle has the complete sentence ending identifier, so that the clients can store the updated current record data as manuscript-fixing record data in the conference record document, and generate new current record data according to preset rules.

In a third aspect, the present application further provides a server, including: a processor, a memory and a meeting record generation program stored in the memory, which when executed by the processor implements the steps of the meeting record generation method as described above.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a meeting record generation program is stored, which when executed by a processor implements the meeting record generation method as above.

The conference record generating method provided by the embodiment of the application is used for a server in a conference system, and comprises the following steps: acquiring a current voice subtitle of a current speaking client of a conference; determining target text cache data corresponding to a current speaking client; updating first current conference data local to a server based on the current voice subtitle and target text cache data to obtain updated first current conference data; the updated first current conference data is sent to the client so that the client can update the current record data in the conference record document based on the first current conference data; if the current voice subtitle has the whole sentence ending mark, the preset mark is sent to all clients, so that the clients can store the updated current record data as manuscript-fixing record data in the conference record document, and new current record data is generated according to preset rules.

Therefore, in this embodiment of the present application, the server is configured with a buffer area corresponding to each client, where the buffer area stores text buffer data, and the text buffer data includes a historical speech phrase subtitle before the current moment, that is, a phrase generated when the user continuously speaks, that is, an incomplete sentence is stored. When a new current voice subtitle is received, the text corresponding to the subtitle is sent to the client together with the incomplete sentence to update the conference record, so that the formation of the subtitle is real-time in the conference, the recording is performed based on the subtitle, and the conference record is updated in real time. If the newly received text has the whole sentence ending mark, prompting the client to store the updated current record data as manuscript-setting record data in the conference record document, and generating new current record data according to a preset rule, namely starting to record a new sentence, thereby ensuring the semantic integrity of the conference record generated by the method.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a conference system according to the present application;

fig. 2 is a schematic structural diagram of a server of a hardware running environment according to an embodiment of the present application.

Fig. 3 is a schematic flow chart of a first embodiment of a conference record generating method of the present application;

fig. 4 is a schematic flow chart of a second embodiment of a conference record generating method of the present application;

fig. 5 is a flowchart of a third embodiment of a conference record generating method according to the present application.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In the related technology, along with the development of artificial intelligence technology, particularly speech recognition technology, not only the accuracy of speech to text is greatly improved, but also the generated text can be subjected to processing such as sentence breaking, punctuation adding and the like according to the semantic, context and other information of a speaker. However, real-time scenes, especially video conferences, are difficult to use due to complete semantics. The users in the scenes such as video conferences or live broadcasting are very sensitive to time delay, so that the captions cannot have great time delay, and the existing scheme basically generates conference records in real time according to the received captions or generates the conference records by processing after the conference is finished.

However, the meeting record is a formal file record form of meeting comparison, the required format is neat, the speech of the participant in the meeting can be truly recorded, and the semantics of the speaker and the like can be embodied. In the existing scheme, real-time performance is met, and the processing of the conference record break sentence can not truly reflect the speaking content of a speaker. The people need a certain time to speak a sentence, the real-time performance is not satisfied if the words are transcribed after the words are completely spoken, if the words are forcedly divided into two or more sentences in the middle, the formed conference record can not completely and truly reflect the semantics of the speaker.

In addition, for the scene of post-processing of the conference recording, the conference recording can only be generated after the conference is finished. The current scheme of scenes such as when a participant leaves halfway and wants to download a conference record, when the participant wants to view the conference record before the conference or when the participant does not understand a sentence of a speaker and wants to view the previous conference record cannot be satisfied.

Therefore, the present application provides a solution, the solution is based on real-time updated subtitles, the conference record is also updated in real time, if the newly received text has a complete sentence end mark, the subtitle buffer is emptied, the client is prompted to store the updated current record data as finalized record data in the conference record document, and new current record data is generated according to a preset rule, so that the semantic integrity of the conference record generated by the present application is ensured.

The inventive concepts of the present application are further described below in conjunction with some specific embodiments.

The following description will be given of a conference system applied to implementation of the technology of the present application:

referring to fig. 1, fig. 1 is a schematic architecture diagram of a conference system according to an exemplary embodiment. As shown in fig. 1, the conference system may include a server 11, a network 12, a client 13, and a voice recognition server 15.

The server 11 may be a physical server comprising a separate host, or the server 11 may be a virtual server carried by a cluster of hosts. During operation, the server 11 may run a server-side program of the online conference application to implement relevant service functions of the application, such as creating a video conference, and send the speaking data of the current speaking client to other clients.

Network 12 may include various types of wired or wireless networks. In one embodiment, the network 12 may include a public switched telephone network (Pub l ic Switched Te l ephone Network, PSTN) and the internet. The client 13 may interact with the server 11 via the network 12, and the speech recognition server 15 may interact with the server 11 via the network 12.

The client 13 may comprise an electronic device of the type such as: smartphones, tablet devices, notebook computers, palm top computers (PDAs, persona l D i gita l Ass i stants), etc., to which one or more embodiments of the present description are not limited. During operation, the client 13 may run a video conference client-side program to implement the relevant business functions of the application.

The voice recognition server 15 may obtain the voice collected by the client and convert it into text by voice recognition technology. It is worth mentioning that, with the development of speech recognition technology, not only the accuracy of speech to text has greatly improved, but also the generated text can be processed such as sentence breaking and punctuation addition according to the information such as the semantics and the context of the speaker.

It will of course be appreciated that in some embodiments the speech recognition server may also be configured as a module in the server, or in the client.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server of a hardware running environment according to an embodiment of the present application.

As shown in fig. 2, the server may include: a processor 1001, such as a central processing unit (Centra lProcess i ng Un it, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a display (Di sp l ay), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a WI-fi (WI-F I) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory or a stable Non-volatile memory (Non-Vo l at i l e Memory, NVM), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the architecture shown in fig. 2 is not limiting and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 2, an operating system, a data storage module, a network communication module, a user interface module, and a key function configuration program may be included in the memory 1005 as one type of storage medium.

In the electromechanical device illustrated in fig. 2, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the server of the present application may be provided in the server, and the server calls the conference record generation program stored in the memory 1005 through the processor 1001 and executes the conference record generation method provided in the embodiment of the present application.

Based on the above hardware structure, but not limited to the above hardware structure, the present application provides a first embodiment of a conference record generating method. Referring to fig. 3, fig. 3 shows a schematic flow chart of a first embodiment of a method for generating an application meeting record.

It should be noted that although a logical order is depicted in the flowchart, in some cases the steps depicted or described may be performed in a different order than presented herein.

In this embodiment, a method for generating a conference record includes:

step S101, acquiring a current voice subtitle of a current speaking client of a conference; the current voice subtitle is obtained by converting short sentence voice data of a current speaking client into text.

Specifically, in this embodiment, each speaker collects audio through a corresponding client, and transmits the audio as one audio stream to the speech recognition server in real time for text transcription, and the speech recognition server transcribes the received multiple audio streams into corresponding multiple subtitles in real time and transmits the multiple subtitles to the server of the conference system. Thus, the server can acquire the current voice subtitle of the current speaking client side of the conference in real time.

It should be noted that, the voice recognition server can determine that the current caption is an intermediate result in real time according to the audio content and the semantics, i.e. an intermediate caption in which a sentence is not yet spoken; and the final result is a short sentence, namely the final subtitle after the sentence is corrected, and punctuation marks are added to the transcribed text according to the semantics. Therefore, in the process that a speaker continuously speaks a complete sentence, in order to ensure real-time performance, a plurality of intermediate results are transmitted in real time and are displayed in the subtitles of all clients completely, so that the real-time performance requirement of the subtitles can be met.

In one example, the current voice subtitle has a voice recognition server sent to the server in the form of a subtitle message. At this time, contents of the caption message include, but are not limited to: (1) subtitles of a speaker. (2) subtitle time stamp. (3) whether or not it is an intermediate result.

In some embodiments, whether an intermediate result is represented by a result type identifier resulttype: resulttype=0, indicating that the front voice subtitle is an intermediate result, and resulttype=1 is a final result. That is, resulttype=1 can be regarded as a phrase end flag.

As in an example, the current voice subtitle is "we first", then resulttype=0; the current voice subtitle is "today weather is good," then resulttype=1.

It will be appreciated that a sentence of a speaker may last longer, for example, a few pauses may occur when the speaker speaks, and signs such as commas may be added when the speaker turns into subtitles, and finally a complete sentence may include one or more final results, that is, one or more phrases, so that the presentation of the conference recording may not be fully as a presentation of the subtitles. Thus, the subtitle and conference recordings are recorded separately in this embodiment. It should be noted that, for the subtitle, in order to meet the real-time requirement of the subtitle, the current voice subtitle can be used for the subtitle field displayed by the client, and the client subtitle can display the original subtitle content transcribed by the voice recognition component in real time.

Step S102, determining target text cache data corresponding to the current speaking client side.

The server is provided with a plurality of cache areas, the cache areas are in one-to-one correspondence with the clients, and the cache areas store text cache data which comprises historical voice short sentence captions before the current moment.

It can be appreciated that in the conference system provided in this embodiment, each participant has a dedicated subtitle buffer, i.e. text buffer data. The text buffer data may be used to record previous historical speech phrase titles. That is, when a sentence of a speaker is long, for example, a few pauses may occur when the speaker speaks, and signs such as commas may be added when the speaker turns into subtitles, and finally, a complete sentence may include one or more final results, that is, one or more phrases, where the phrases may be stored in a buffer area, so as to obtain text buffer data.

Step S103, updating the first current conference data of the local server based on the current voice subtitle and the target text cache data, and obtaining updated first current conference data.

Specifically, the server updates the first current conference data local to the server according to the latest received current voice subtitle and target text cache data.

It can be understood that when the target text buffer data is blank data, the current voice subtitle is directly used as the text in the current conference data.

Or when the target text cache data stores a plurality of phrases, splicing the current voice subtitle to the last phrase of the target text cache data, thereby forming the first current conference data.

In one example, the current voice caption is "buy snack". ", and the target text cache data stores a plurality of phrases: "today's weather is good," "we go to supermarket first,". And "today weather is good," timestamp before "we go to supermarket first," then "buy snack. The "spliced" we go to supermarket first, after "the first current meeting data is" today's weather is good, we go to supermarket first, buy snack. ".

Step S104, the updated first current conference data is sent to the client, so that the client updates the current record data in the conference record document based on the first current conference data.

After the first current conference data is obtained, the server sends the first current conference data to all clients, so that all clients update the current record data in the conference record document based on the first current conference data.

The conference record of the client may be preset with a conference record template, where the conference record template includes, but is not limited to: finalized recording data and current recording data. The finalization record data is an updated sentence with complete semantics or a broken sentence. The current recording data is updated in real time, and changes along with the subtitle.

In one example, the speaker says "today's weather is good, we go to the supermarket first, buying the snack. In the process of this sentence, the current record data is continuously updated by the first "today" until it becomes: "today's weather is good, we go to supermarket first, buy snack. ". Of course, it will be appreciated that the caption at this time is "buy snack". It will also be appreciated that this sentence has a period, i.e., a complete sentence end identifier, at which point the current record data is converted to finalized record data for storage.

Step 105, if the current voice subtitle has the complete sentence ending identifier, the preset identifiers are sent to all clients, so that the clients store the updated current record data as finalized record data in the conference record document, and new current record data is generated according to preset rules.

Specifically, the sentence ending identifier includes, but is not limited to, punctuation marks such as a period, a question mark, or an exclamation mark added to the subtitle by the voice recognition server based on the voice recognition technology. It should be noted that the whole sentence ending identifier may be a chinese identifier or an english identifier, which is not limited herein.

That is, in this embodiment, if the newly received text has the whole sentence ending identifier, the client is prompted to save the updated current record data as the finalized record data in the conference record document, so as to ensure the semantic integrity of the conference record generated in this application.

In addition, in order to facilitate the subsequent recording, in this embodiment, the client is also prompted to generate new current recording data according to a preset rule. The preset rules herein may be configured in advance by the server or by the client.

If in an example, the preset rule is that the adjacent sentences with complete semantics need to be expressed in segments, the client saves the current record data as the finalized record data in the conference record document, and needs to be divided into lines to display the new updated current record data in a new line.

The continuous sentence of the speaker is as follows: "today's weather is good, we go to supermarket first, buy snack. After buying things, we sit together to go to suburb camping bar-! ". When the current voice caption is "buy snack". At this time, the client will go to supermarket to buy snack at first, which is true today. The data is stored in the conference record file as the manuscript record data, then the line is changed, and after the things are purchased, after buying "-" things, we sit together to go to suburban camping bar-! . "then feed the line again.

It will be appreciated that in some examples, the preset rule may also be that the finalized recording data is not displayed prominently, but that the current recording data is displayed prominently. Alternatively, the preset rule may be that the finalized recording data is in a black font, and the current recording data is in a green font, or the like.

It should be noted that the current recording data in this embodiment may be blank data.

Alternatively, as an option of this embodiment, the step of updating the first current conference data local to the server based on the current voice subtitle and the target text buffer data, to obtain updated first current conference data includes:

(1) If the current voice subtitle does not have the whole sentence ending mark, updating first current conference data local to the server based on the current voice subtitle and the target text cache data to obtain current conference intermediate data;

(2) If the result type identifier is provided with the whole sentence ending identifier, updating first current conference data local to the server based on the current voice subtitle and the target text cache data to obtain manuscript conference data, wherein the manuscript conference data carries a preset identifier.

The preset mark may be a special mark carried in a message sent from the server to the client. That is, in this embodiment, the current conference intermediate data and the finalized conference data may be distinguished by different flag information, and when the client receives the current conference intermediate data, it continues to update the current record data in real time. When the finalized conference data is obtained, the client immediately saves the updated current record data as the finalized record data in the conference record document after updating the current record data, and generates new current record data according to a preset rule, so that the semantic integrity of the conference record generated by the application is ensured.

Therefore, in this embodiment, the server is configured with a buffer area corresponding to each client, and the buffer area stores text buffer data, where the text buffer data includes a historical voice phrase subtitle before the current moment, that is, a phrase generated when the user continuously speaks, that is, an incomplete phrase. When a new current voice subtitle is received, the text corresponding to the subtitle is sent to the client together with the incomplete sentence to update the conference record, so that the formation of the subtitle is real-time in the conference, and the conference record is updated in real time based on the subtitle. If the newly received text has the whole sentence ending mark, the client is prompted to store the updated current record data as manuscript-defining record data in the conference record document, and new current record data is generated according to a preset rule, namely, a new sentence is started to be recorded, so that the semantic integrity of the conference record generated by the embodiment is ensured.

In addition, in this embodiment, the conference record is stored in the client, and the client may also query the history conference record.

Based on the above embodiments, a second embodiment of the conference record generating method of the present application is provided. Referring to fig. 4, fig. 4 shows a flowchart of a second embodiment of the application conference record generation method.

In this embodiment, a method for generating a conference record includes:

step S201, acquiring a current voice subtitle of a current speaking client of a conference; the current voice subtitle is obtained by converting short sentence voice data of a current speaking client into text.

Step S202, determining target text cache data corresponding to the current speaking client side.

Step S203, extracting the result type identifier of the current voice subtitle;

and step S204, if the result type identifier is a short sentence ending identifier and does not have a whole sentence ending identifier, updating the target text cache data based on the current voice subtitle.

Step S205, if the result type identifier is a short sentence ending identifier and has a whole sentence ending identifier, the target text cache data is emptied.

Step S206, updating the first current conference data of the local server based on the current voice subtitle and the target text cache data, and obtaining updated first current conference data.

Step S207, the updated first current conference data is sent to the client, so that the client updates the current record data in the conference record document based on the first current conference data.

Step S208, if the current voice subtitle has the whole sentence ending mark, the preset mark is sent to all clients, so that the clients can store the updated current record data as manuscript-fixing record data in the conference record document, and new current record data is generated according to the preset rule.

Specifically, in this embodiment, the current voice subtitle is sent to the server by the voice recognition server in the form of a subtitle message. At this time, contents of the caption message include, but are not limited to: (1) subtitles of a speaker. (2) subtitle time stamp. (3) whether or not it is an intermediate result.

Whether an intermediate result is an intermediate result may be represented by a result type identifier resulttype: resulttype=0, indicating that the front voice subtitle is an intermediate result, and resulttype=1 is a final result. That is, resulttype=1 can be regarded as a phrase end flag. As in an example, the current voice subtitle is "we first", then resulttype=0; the current voice subtitle is "today weather is good," then resulttype=1.

At this time, if the current voice subtitle is an intermediate result, no buffering is performed.

And when the current voice subtitle is the final result, namely the short sentence, and no sentence break exists, namely the end of a sentence with complete semantics, the final result of the user cached before, namely the target text cached data and the final result of the time are overlapped, namely the current voice subtitle is updated into the corresponding cache area of the client, and the updated target text cached data is obtained.

That is, in this embodiment, the target text buffer data implements a function of storing only short sentences, so as to avoid updating a large amount of invalid data generated by buffer in real time based on subtitles, especially intermediate results thereof, in the process of generating subtitles in real time. In addition, it can be appreciated that the target text cache data realizes the function of storing only short sentences, and the accuracy of meeting record generation is improved.

And when the original subtitle is the final result and the sentence is broken, namely the whole sentence end mark is provided, the target text cache data is emptied. As in one example, a sustained period as a speaker is: "today's weather is good, we go to supermarket first, buy snack. ". When the current voice subtitle is "today," then resulttype=0. The buffer is not performed until the current voice subtitle is "weather today" when the current voice subtitle is good, and at the moment, resultType=1, but no whole sentence ending mark such as a period is provided, and the update target text buffer data is "weather today" when the current voice subtitle is good. Of course, when the current voice caption is "buy snack". "when, although resulttype=1, there is. And (3) clearing the cache area if the whole sentence is ended. The target text cache data is a blank document.

It can be seen that, in this embodiment, the buffer area is emptied every time a sentence is generated with complete semantics. The target text cache data is a blank document. The data volume of the cache area can be ensured to be smaller, a large amount of historical data is prevented from influencing the generation of subsequent meeting records, and the accuracy of meeting record generation is further improved.

That is, in this embodiment, the historical speech phrase subtitle is a phrase that does not include the phrase distributed in the case that two phrases are semantically complete.

As an embodiment, before step S208, the method further includes:

(1) Judging whether the target text cache data is blank data or not;

(2) If the voice subtitle is blank data, judging whether the current voice subtitle has a whole sentence ending mark or not.

Specifically, if the current voice subtitle is the final result and the targeted text buffer data is not blank data, step S204 is performed normally. If the current voice subtitle is the final result and the target text cache data is blank data, directly judging whether the current voice subtitle has a whole sentence ending mark or not.

Based on the above embodiments, a third embodiment of the conference record generating method of the present application is provided. Referring to fig. 5, fig. 5 shows a schematic flow chart of a third embodiment of a method for generating an application meeting record.

In this embodiment, the method includes:

step S301, acquiring a current voice subtitle of a current speaking client of a conference; the current voice subtitle is obtained by converting short sentence voice data of a current speaking client into text.

Step S302, determining target text cache data corresponding to the current speaking client side.

Step S303, judging whether other text cache data except the target text cache data in all the text cache data are blank data or not;

and step S304, if the non-uniformity is blank data, taking at least one client corresponding to other text cache data which is not blank data as a broken speaking client.

And step S305, adding a short sentence ending identifier and a preset breaking identifier to the latest voice subtitle of the broken speaking client, and updating other text cache data which are not blank data to obtain broken text cache data.

Step S306, extracting time stamps in other text cache data which are not blank data;

step S307, according to the time stamp, obtaining the ordering result of other text cache data which is not blank data;

Step S308, according to the sorting result, sequentially sending other text cache data which are not blank data to the client, so that the client updates the current record data in the conference record document based on the other text cache data which are not blank data, saves the updated current record data in the conference record document as manuscript-fixing record data, and generates new current record data according to a preset rule.

Step S309, if the voice caption and the target text buffer data are blank data, updating the first current conference data of the server local based on the current voice caption and the target text buffer data, and obtaining updated first current conference data;

step S310, the updated first current conference data is sent to the client, so that the client updates the current record data in the conference record document based on the first current conference data.

Step S311, if the current voice subtitle has the whole sentence ending mark, the preset mark is sent to all clients, so that the clients store the updated current record data as manuscript-fixing record data in the conference record document, and new current record data is generated according to the preset rule.

Specifically, in this embodiment, the current voice subtitle is sent to the server by the voice recognition server in the form of a subtitle message. At this time, contents of the caption message include, but are not limited to: (1) subtitles of a speaker; (2) a subtitle timestamp; (3) whether it is an intermediate result; (4) a speaker I D; (5) the name of the speaker; (6) conference I D.

At this time, the server fetches I D information of other participants in the current conference except the speaking client based on I D of the speaker of the current voice subtitle or the name of the speaker. Other text cache data than the target text cache data is then determined. Because each generated sentence is semantically complete, the corresponding cache area is emptied. Thus, if the text buffer data except the target text buffer data are blank data, it indicates that no other person is speaking in the latest time, that is, the current speaker does not interrupt other person, and the following steps 309 to 311 are continued. The implementation of step 309 to step 311 is shown in step S103 to step S105 or step S206 to step S207 of the above embodiments, and will not be repeated here.

If the current speaker breaks the speech of the previous speaker, at this time, the voice recognition server does not add a whole sentence ending identifier to the voice subtitle of the previous speaker, and accordingly, the buffer area corresponding to the previous speaker is not emptied. Therefore, at the moment, the non-uniformity of the text cache data except the target text cache data is blank data, and the condition of speech interruption is determined.

Since the speech of the previous speaker is incomplete, that is, the previous speaker and the closest voice subtitle in time at the current moment are intermediate results, not final results. In this case, the server does not cache the latest voice subtitle in the cache area corresponding to the previous speaker. Therefore, in this embodiment, after determining that the speech interruption condition exists, a phrase ending identifier and a preset interruption identifier are added to the latest speech subtitle of the previous speaker, so that the latest speech subtitle is cached in the cache area to update other text cache data that is not blank data, and interruption text cache data is obtained. It can be appreciated that the preset interrupt identifier is not only used to represent the actual situation that the speaking is interrupted in the conference record, but also can be regarded as or is equivalent to the whole sentence ending identifier in the conference record generating process, so that the speaking of the previous speaker is ended. The server also has to empty the break text cache data after the update.

After the breaking text cache data is extracted, the current voice caption belongs to the current speaker, but not the previous speaker, so that the conference record is only generated according to the extracted breaking text cache data of the previous speaker, and the current voice caption is not required to be combined. The specific generation process of the conference record is that the interrupt text cache data is sent to all clients, and all clients update the current record data of the conference record document and store the current record data as manuscript-fixing record data. At this time, there may be a character portion corresponding to the previous speaker in the current recording data, and the character portion is completed and converted into the finalized recording data.

It should be noted that, in this process, the current speaker still continues speaking, so the server may generate the conference record of the previous speaker, and perform operations such as text buffering on the current speaker client, and since the text buffering on the current speaker client depends on the speaking speed of the current speaker, the conference record generation on the previous speaker and the text buffering on the current speaker client do not affect each other.

As in one example, the previous speaker is speaking "today weather is good, we are first" by the current speaker "can go to play the badminton-! The "break, the server buffers the text buffer data of" today's weather is good "in the buffer area corresponding to the previous speaker, and" we first "as the intermediate result is not buffered. At this time, the server obtains the "can go to play badminton-! And judging that the text cache data of the previous speaker is not blank data, namely determining that the speech interruption condition occurs. At this time, the server can obtain the latest voice subtitle of 'we first' from the subtitle server, the message buffer area or the subtitle buffer area of the client, and then adds a short sentence ending mark and a preset breaking mark to the latest voice subtitle, so that the text buffer data of the previous speaker is updated to 'today's weather is good, and we first (broken). And then the server generates manuscript conference data according to ' today's weather is good, and sends the manuscript conference data to the client, and the client updates ' today's weather is good ' in the current recorded data to ' today's weather is good, and the manuscript conference data is stored as manuscript recorded data. And then feed the line again.

Of course, since there may be consecutive breaks during the actual conference, there may be 2 or more clients that are broken into talk. 2 or more interrupted talk clients exist before and after time. At this time, after updating the text cache data of all the interrupted speaking clients, all the interrupted text cache data are obtained. And then sorting according to the time stamps of all the broken text cache data. It should be noted that, the timestamp of breaking the text buffer data may be a generated timestamp of the subtitle corresponding to the first phrase of each breaking the text buffer data, and the timestamp may reflect the order of speaking more accurately. After the ordering is finished, sequentially sending the plurality of breaking text cache data to the client, and sequentially generating the finalization record data.

In this embodiment, the conference record can also perfectly embody a relatively complex context such as one or more person-to-person conversation breaks in the middle of speaking, so as to ensure the accuracy of the conference record.

Based on the same inventive concept, in a second aspect, the present application further provides a conference record generating device, configured on a server of a conference system, where the device includes:

It should be noted that, in this embodiment, each implementation manner of the conference record generating device and the technical effects achieved by the implementation manner may refer to various implementation manners of the conference record generating method in the foregoing embodiment, which are not described herein again.

In addition, the embodiment of the application also provides a computer storage medium, and a conference record generating program is stored on the storage medium, and when the conference record generating program is executed by a processor, the steps of the conference record generating method are realized. Therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-On/Read memory (ROM), a Random access memory (Random AccessMemory, RAM), or the like.

It should be further noted that the above-described apparatus embodiments are merely illustrative, where elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based On such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-only memory (ROM), a random-access memory (RAM, randomAccessMemory), a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute the method of the embodiments of the present application.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A conference recording generation method, characterized by being used in a server in a conference system, the method comprising:

acquiring a current voice subtitle of a current speaking client of a conference; the current voice subtitle is obtained by converting voice data of the current speaking client into text;

determining target text cache data corresponding to the current speaking client side;

updating the first current conference data of the local server based on the current voice subtitle and the target text cache data to obtain updated first current conference data;

transmitting the updated first current conference data to the client so that the client updates the current record data in the conference record document based on the first current conference data;

and if the current voice subtitle has the whole sentence ending mark, sending a preset mark to the client so that the client can store the updated current record data as manuscript-fixing record data in the conference record document, and generating new current record data according to a preset rule.

2. The conference recording generation method according to claim 1, wherein said updating the first current conference data local to the server based on the current voice subtitle and the target text buffer data, before obtaining the updated first current conference data, further comprises:

extracting a result type identifier of the current voice subtitle;

and if the result type identifier is a short sentence ending identifier and does not have a whole sentence ending identifier, updating the target text cache data based on the current voice subtitle.

3. The conference recording generation method according to claim 2, wherein after the extracting the result type identifier of the current voice subtitle, the method further comprises:

and if the result type identifier is the short sentence ending identifier and has the whole sentence ending identifier, the target text cache data is emptied.

4. The conference recording generation method according to claim 3, wherein after said extracting the result type identifier of the current voice subtitle, the method further comprises:

judging whether the target text cache data is blank data or not;

5. The conference recording generation method according to claim 1, wherein said updating the first current conference data local to the server based on the current voice subtitle and the target text buffer data, and obtaining updated first current conference data, comprises:

if the current voice subtitle does not have the sentence finishing identifier, updating first current conference data local to the server based on the current voice subtitle and the target text cache data to obtain current conference intermediate data;

if the result type identifier is provided with a sentence finishing identifier, updating first current conference data local to the server based on the current voice subtitle and the target text cache data to obtain manuscript conference data, wherein the manuscript conference data carries the preset identifier.

6. The conference recording generation method according to any one of claims 1 to 5, wherein after said determining the target text buffer data corresponding to the current speaking client, the method further comprises:

If the non-uniformity is blank data, taking at least one client corresponding to the other text cache data which is not blank data as a broken speaking client;

adding a short sentence ending identifier and a preset interrupting identifier to the latest voice subtitle of the interrupted speaking client, and updating the other text cache data which are not blank data to obtain interrupted text cache data;

extracting at least one timestamp of the broken text cache data;

according to the time stamp, a sequencing result of at least one breaking text cache data is obtained;

7. The conference recording generation method according to claim 6, wherein after said judging whether or not the text buffer data other than the target text buffer data among all the text buffer data are blank data, said method further comprises:

And if the current voice subtitle and the target text buffer data are blank data, executing the process of updating the first current conference data of the server local based on the current voice subtitle and the target text buffer data, and obtaining updated first current conference data.

8. A conference recording generation apparatus, characterized by being configured in a server of a conference system, the apparatus comprising:

the text acquisition module is used for acquiring the current voice subtitle of the current speaking client side of the conference; the current text information is obtained by converting voice data of the current speaking client into text;

the conference data updating module is used for updating the first current conference data of the server local based on the current voice subtitle and the target text cache data to obtain updated first current conference data;

the record sending module is used for sending the updated first current conference data to all the clients so that all the clients update the conference record document local to the clients based on the first current conference data;

And the record storage module is used for sending preset identifiers to all the clients if the current voice subtitle has the whole sentence ending identifier, so that the clients store the updated current record data as manuscript-fixing record data in the conference record document, and generate new current record data according to preset rules.

9. A server, comprising: a processor, a memory and a meeting record generation program stored in the memory, which when executed by the processor, implements the steps of the meeting record generation method of any of claims 1-7.

10. A computer-readable storage medium, wherein a conference record generation program is stored on the computer-readable storage medium, which when executed by a processor, implements the conference record generation method according to any one of claims 1 to 7.