WO2022037600A1 - 摘要记录方法、装置、计算机设备和存储介质 - Google Patents

摘要记录方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022037600A1
WO2022037600A1 PCT/CN2021/113206 CN2021113206W WO2022037600A1 WO 2022037600 A1 WO2022037600 A1 WO 2022037600A1 CN 2021113206 W CN2021113206 W CN 2021113206W WO 2022037600 A1 WO2022037600 A1 WO 2022037600A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
text
target
abstract
target content
Prior art date
Application number
PCT/CN2021/113206
Other languages
English (en)
French (fr)
Inventor
辛格希曼舒
Original Assignee
深圳市万普拉斯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市万普拉斯科技有限公司 filed Critical 深圳市万普拉斯科技有限公司
Publication of WO2022037600A1 publication Critical patent/WO2022037600A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus, computer equipment and storage medium for abstract recording.
  • users generally record content by manually editing text on a mobile terminal, or a voice recognition device recognizes and stores the content to be recorded; however, the current recording method has the problem of low recording accuracy.
  • a summary recording method comprising:
  • the text information is processed by a preset number of trained machine learning models, and corresponding candidate text summaries are obtained respectively;
  • the receiving audio data corresponding to the target content on the display interface includes:
  • the recording instruction carries the recording duration, and before the voice recognition is performed on the audio data to obtain text information corresponding to the audio data, the method further includes:
  • the step of performing speech recognition on the audio data to obtain text information corresponding to the audio data is performed.
  • the displaying each of the candidate text summaries on the terminal in a preset format includes any one of the following forms:
  • a display label corresponding to each of the candidate text summaries is generated, and each of the candidate text summaries is folded and displayed in the display area of the terminal through the display label.
  • the obtaining the target text abstract determined from the candidate text abstracts, and associating the target text abstract with the target content includes:
  • the method further includes:
  • speech recognition is performed on the audio data to obtain text information corresponding to the audio data.
  • the method includes:
  • the target text abstract associated with the target content is input into the machine learning model, and the machine learning model is updated to obtain an updated machine learning model.
  • a summary recording device comprising:
  • a receiving module for receiving audio data corresponding to the target content on the display interface
  • a speech recognition module configured to perform speech recognition on the audio data to obtain text information corresponding to the audio data
  • a processing module configured to process the text information through a preset number of trained machine learning models to obtain corresponding candidate text summaries respectively;
  • a display module configured to display each of the candidate text summaries on the terminal in a preset format
  • an association module configured to obtain a target text abstract determined from the candidate text abstracts, and associate the target text abstract with the target content.
  • a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
  • the text information is processed by a preset number of trained machine learning models, and corresponding candidate text summaries are obtained respectively;
  • a target text abstract determined from the candidate text abstracts is acquired, and the target text abstract is associated with the target content.
  • the text information is processed by a preset number of trained machine learning models, and corresponding candidate text summaries are obtained respectively;
  • a target text abstract determined from the candidate text abstracts is acquired, and the target text abstract is associated with the target content.
  • the text information corresponding to the audio data is obtained by identifying the audio data of the target content; the text information is processed by a preset number of trained machine learning models to obtain The text summary after each machine learning model processes the text information; by providing the user with multiple text summaries of the audio data, the user can select the text summary with higher accuracy from the multiple text summaries, which improves the accuracy of the recording. sex.
  • FIG. 1 is an application environment diagram of a summary recording method in one embodiment
  • Fig. 2 is a schematic flowchart of a summary recording method in one embodiment
  • Fig. 3 (a) is a schematic diagram of the display of candidate text abstracts in one embodiment
  • Fig. 3 (b) is a schematic diagram of displayed after the target content is associated with the target text abstract in one embodiment
  • Fig. 4 is a schematic flowchart of the updating steps of the machine learning model in the summary record in one embodiment
  • FIG. 5 is a schematic flowchart of a summary recording method in another embodiment
  • FIG. 6 is a schematic diagram of an application scenario of the summary recording method in one embodiment
  • FIG. 8 is a structural block diagram of a summary recording apparatus in one embodiment
  • FIG. 9 is a structural block diagram of a summary recording apparatus in another embodiment.
  • Figure 10 is a diagram of the internal structure of a computer device in one embodiment.
  • the abstract recording method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through the network.
  • the server receives the audio data corresponding to the target content on the display interface of the terminal; performs speech recognition on the audio data to obtain text information corresponding to the audio data; processes the text information through a preset number of trained machine learning models to obtain corresponding candidates respectively Text abstract; display each candidate text abstract on the terminal in a preset format; acquire the text abstract determined from the candidate text abstract, and associate the text abstract with the target content.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a summary recording method is provided, and the method is applied to the terminal in FIG. 1 as an example for description, including the following steps:
  • Step 202 Receive audio data corresponding to the target content on the display interface.
  • the target content is the content displayed on the terminal display interface.
  • the target content on the display interface of the terminal is course content data; in another example, when multiple people in an enterprise conduct a video conference, the target content displayed on the display interface of the terminal is conference data.
  • Audio data is the speech data of the speaker describing the target content.
  • the audio data is the speech data of the speaker explaining the course content data.
  • the audio data can be obtained from the server or recorded through the microphone of the terminal, and the way of obtaining the audio data is not limited here.
  • the terminal when the terminal detects that the microphone is in the monitoring mode, it responds to the recording instruction corresponding to the target content triggered by the terminal display interface, and obtains the audio data of the target content on the display interface.
  • the triggering method of the recording instruction can be obtained by the user triggering the recording button on the display interface; it can also be achieved by touching or pressing the display area where the target content is located, and the touch can include single-touch, multi-touch , and pressing can include long pressing and clicking.
  • Step 204 Perform speech recognition on the audio data to obtain text information corresponding to the audio data.
  • the terminal acquires the audio data of the target content, inputs the audio data into the pre-trained speech recognition model, classifies the audio data through the speech classification algorithm of the speech recognition model, determines the speech type in the audio data, and matches the speech recognition model from the speech recognition model.
  • the speech recognition algorithm associated with each speech type recognizes the corresponding audio data through the speech recognition algorithm to obtain text information corresponding to the audio data.
  • the acquired audio data includes audio data of different language types such as Chinese, English, and German, and the audio data is classified by the speech classification algorithm in the speech recognition model to obtain different types of audio data.
  • the speech recognition algorithm of the text recognizes the corresponding audio data respectively, and obtains the corresponding text information.
  • noise reduction algorithm is used to perform noise reduction processing on the audio data corresponding to the target content. , and then use a de-reverb audio plug-in or microphone array to remove the reverberation.
  • Noise reduction algorithms may include adaptive filters, spectral subtraction, Wiener filtering, and the like.
  • Step 206 Process the text information through a preset number of trained machine learning models to obtain corresponding candidate text summaries respectively.
  • the preset number is the number of machine learning models preset for processing text information.
  • the preset number can be 5, 6, 8, etc.
  • Each of the preset number of trained machine learning models has different initial weights, model training iterations, hyperparameters, and learning rates.
  • Text summarization is a generalization of textual information in the form of words or/and phrases.
  • the terminal respectively inputs the text information into a preset number of trained machine learning models, processes the text information through the machine learning model, and outputs candidate text summaries matching the text information. For example, by processing the text information through a preset number of K trained machine learning models, K candidate text summaries are obtained.
  • Step 208 Display each candidate text abstract in a preset format.
  • the preset format refers to a preset display format.
  • the preset format can be that each candidate text abstract is displayed on the terminal display interface in the form of a list, and each candidate text abstract is expanded in the form of a text list in the display box, and the display box has the display function of maximizing and minimizing;
  • Each candidate text abstract is folded and displayed on the display interface, by generating a display label corresponding to each candidate text abstract, receiving a viewing instruction of the display label triggered by the display interface, and responding to the viewing instruction, displaying the text abstract corresponding to the display label in the form of a display box on the display interface of the terminal.
  • the terminal responds to a display instruction triggered by the user on the display interface, and the display instruction carries a preset format type; the candidate text abstract is in the preset format type carried by the display instruction in the displayed on the display interface of the terminal.
  • 3 is a schematic diagram of the effect of displaying candidate text summaries in a list form, the left display area of the display interface is the target content, and the right display area is the candidate text summaries of the target content.
  • Step 210 Obtain the target text abstract determined from the candidate text abstracts, and associate the target text abstract with the target content.
  • the terminal responds to the selection instruction input by the user, determines the target text abstract from the candidate text abstracts according to the selection instruction, establishes a mapping relationship between the target text abstract and the target content, and associates the target text abstract with the target content through the mapping relationship .
  • the terminal receives an abstract editing instruction triggered by the display interface; according to the abstract editing instruction, the corresponding candidate text abstracts in the candidate text abstracts are edited to obtain the target text abstract, and the target text abstract is associated with the target content.
  • the terminal receives audio data corresponding to the target content on the terminal display interface; performs speech recognition on the audio data to obtain text information corresponding to the audio data; Perform processing to obtain corresponding candidate text abstracts respectively; display each candidate text abstract on the terminal in a preset format; acquire target text abstracts determined from the candidate text abstracts, and associate the target text abstracts with the target content.
  • a target text abstract corresponding to the target content is obtained from multiple candidate text abstracts, which avoids incomplete and inaccurate records caused by user handwriting, and improves the accuracy of abstract records.
  • a step for updating a machine learning model in a summary record is provided, and the method is applied to the terminal in FIG. 1 as an example to illustrate, including the following steps:
  • Step 402 Obtain the text digest to be edited determined from the candidate text digests.
  • Step 404 Receive a summary editing instruction triggered by the display interface.
  • the editing instruction can be used to modify, delete, etc. the candidate text abstract.
  • Editing instructions include deletion instructions, modification instructions, and the like.
  • the summary editing instruction can be triggered by the user clicking the editing button on the display interface.
  • Step 406 edit the text digest to be edited according to the digest editing instruction, obtain a target text digest, and associate the target text digest with the target content.
  • the abstract editing instruction carries a text abstract identifier
  • the terminal edits the candidate text abstract corresponding to the text abstract identifier according to the abstract editing instruction, uses the edited candidate text abstract as the target text abstract, and associates the target text abstract with the target content.
  • Step 408 input the target text abstract associated with the target content into the machine learning model, update the machine learning model, and obtain the updated machine learning model.
  • the machine learning model is a model composed of an encoder and a decoder based on the attention mechanism.
  • the target text abstract associated with the target content is encoded by the encoder, and the encoded target text abstract is used as input to train the machine learning model.
  • gradient descent can also be used to update the machine learning model.
  • the terminal receives an abstract editing instruction triggered by the display interface; edits the corresponding candidate text abstract according to the abstract editing instruction, determines the target text abstract from the candidate text abstract, and associates the target text abstract with the target content ; Input the target text abstract associated with the target content into the machine learning model, update the machine learning model, and obtain the updated machine learning model.
  • the machine learning model is continuously optimized according to the target text abstract, so as to improve the accuracy of the text information processing result of the machine learning model.
  • a summary recording method is provided, and the method is applied to the terminal in FIG. 1 as an example for description, including the following steps:
  • Step 502 Receive a content confirmation instruction triggered on the display interface.
  • the content confirmation instruction is used to determine the target content on the display interface; the content confirmation instruction can be triggered and generated by a user's sliding operation or click operation on the display interface. For example, the user can click or swipe on the display interface with a finger or a stylus to determine the target area.
  • Step 504 Determine the target content from the display interface according to the content confirmation instruction.
  • the terminal responds to the content confirmation instruction, determines a target area on the display interface according to the content confirmation instruction, and acquires the corresponding target content from the target area.
  • Step 506 in response to the recording instruction for the target content, obtain audio data corresponding to the target content.
  • the recording instruction carries the recording duration.
  • the recording instruction also carries a speaker identification; the speaker identification is used to distinguish different speakers.
  • Speaker ID can be a string of numbers or letters.
  • the user can select the speaker to be recorded on the display interface.
  • the terminal display interface displays the target content and speakers.
  • the number of speakers can be 1, 2, 3, etc., such as speaker 1 and speaker n displayed on the display interface.
  • Step 508 determine whether the recording duration is greater than the preset recording duration; when the recording duration is less than or equal to the preset recording duration, go to Step 510, otherwise, go to Step 518.
  • the number of sentences in the audio data is obtained; when the number of sentences is less than or equal to the number threshold, speech recognition is performed on the audio data to obtain text information corresponding to the audio data .
  • the quantity threshold is the maximum capacity of the audio data recognized by the preset speech recognition model.
  • the recording duration of the audio data and the number of sentences in the audio data are judged.
  • the recording duration is less than or equal to the preset recording duration and the number of sentences is less than or equal to the number threshold, the preset recording duration in the terminal.
  • the speech recognition model performs speech recognition on audio data to obtain text information; it ensures the accuracy and integrity of the recognized text information.
  • Step 510 Perform speech recognition on the audio data to obtain text information corresponding to the audio data.
  • Step 512 Display each candidate text summary on the terminal in a preset format.
  • displaying each candidate text abstract on the terminal in a preset format includes any one of the following forms: expanding and displaying each candidate text abstract set in the display area of the terminal in the form of a display frame; or generating each Display labels corresponding to the candidate text summaries, through the display labels, the candidate text summaries are folded and displayed in the display area of the terminal.
  • Step 514 Obtain the target text abstract determined from the candidate text abstracts, and associate the target text abstract with the target content.
  • acquiring the text abstract determined from the candidate text abstracts, and associating the text abstract with the target content includes: acquiring the text abstract to be edited determined from the candidate text abstract; receiving an abstract editing instruction triggered by the text abstract to be edited; Edit the text abstract to be edited according to the abstract editing instruction, obtain the target text abstract, and associate the target text abstract with the target content.
  • Step 516 Input the text abstract associated with the target content into the machine learning model, update the machine learning model, and obtain an updated preset machine learning model.
  • Step 518 displaying abnormal information.
  • the abnormal information is used to indicate that the audio data is abnormal, that is, the preset speech recognition model cannot perform speech recognition on the audio data.
  • the terminal receives the content confirmation instruction triggered on the display interface, determines the target content from the display interface according to the content confirmation instruction, and responds to the recording instruction generated by the user triggering the click of the recording button, wherein the recording instruction carries the recording duration, and the recording duration is TN to T+N seconds, to get the audio data from TN to T+N seconds.
  • Send the audio data to the server perform speech recognition on the audio data through the preset speech recognition model in the server, obtain text information corresponding to the audio data, and input the obtained text information into the K trained machine learning models.
  • a trained machine learning model processes the text, obtains K candidate text summaries, and sends the K candidate text summaries to the terminal for display on the display interface of the terminal.
  • the terminal receives the summary editing instruction input by the user, edits the candidate text summary according to the summary editing instruction, obtains the target text summary, and associates the target text summary with the target content; the associated target content and the target text summary are used as training samples for training
  • a machine learning model is obtained, and an updated machine learning model is obtained, wherein the method of association may be to establish a mapping relationship between the target content and the target text abstract.
  • the audio data of the target content is obtained through the terminal, and the audio data is processed by the machine learning model to obtain the target text summary corresponding to the target content, which does not require the user to record manually, which reduces the time spent by the user for recording, and improves the recording efficiency and recording. accuracy.
  • the target content is determined from the display interface according to the content confirmation instruction; in response to the recording instruction of the target content, audio data corresponding to the target content is obtained, and the recording instruction carries the recording duration; Determine whether the recording duration is longer than the preset recording duration, if the recording duration is longer than the preset recording duration, if not, display abnormal information; if so, perform speech recognition on the audio data to obtain the text information corresponding to the audio data;
  • the preset format is displayed on the terminal, the target text abstract determined from the candidate text abstract is obtained, and the target text abstract is associated with the target content; the text abstract associated with the target content is input into the preset machine learning model, and the preset The machine learning model is updated to obtain the updated preset machine learning model.
  • steps in the flowcharts of FIGS. 2 and 4-5 are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 2 and 4-5 may include multiple steps or multiple stages, and these steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. These steps or The order of execution of the stages is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in the other steps.
  • a summary recording apparatus including: a receiving module 802, a speech recognition module 804, a processing module 806, a display module 808 and an association module 810, wherein:
  • a receiving module 802 configured to receive audio data corresponding to the target content on the display interface
  • a speech recognition module 804 configured to perform speech recognition on the audio data to obtain text information corresponding to the audio data;
  • the processing module 806 is configured to process the text information through a preset number of trained machine learning models to obtain corresponding candidate text summaries respectively;
  • a display module 808, configured to display each candidate text summary on the terminal in a preset format
  • the association module 810 is configured to obtain the target text abstract determined from the candidate text abstracts, and associate the target text abstract with the target content.
  • the audio data corresponding to the target content on the display interface is received by the receiving module 802 in the terminal; the speech recognition module 804 performs speech recognition on the received audio data to obtain text information corresponding to the audio data; according to the processing module 806 A preset number of trained machine learning models process the text information to obtain corresponding candidate text summaries respectively; the display module 808 displays each candidate text summaries on the terminal in a preset format; The target text summaries determined in the candidate text summaries associate the target text summaries with the target content.
  • multiple text summaries of the audio data are provided to the user, so that the user can obtain the target text summaries corresponding to the target content from the multiple candidate text summaries, which avoids incomplete and incomplete records caused by user handwriting. Accurate, improving the accuracy of summary records.
  • a summary recording apparatus which in addition to the receiving module 802, the speech recognition module 804, the processing module 806, the display module 808 and the association module 810, also includes: a response Module 812, judgment module 814 and update module 816, wherein:
  • the receiving module 802 is further configured to receive a content confirmation instruction triggered on the display interface; and determine the target content from the display interface according to the content confirmation instruction.
  • the receiving module 802 is further configured to receive a summary editing instruction triggered by the display interface.
  • the response module 812 is configured to respond to the recording instruction of the target content to obtain audio data corresponding to the target content.
  • the display module 808 is further configured to expand and display each candidate text abstract set in the display area of the terminal in the form of a display frame;
  • the judgment module 814 is used for judging whether the recording duration is greater than the preset recording duration; when the recording duration is less than or equal to the preset recording duration, perform voice recognition on the audio data to obtain the corresponding text information steps of the audio data.
  • the judgment module 814 is further configured to obtain the number of sentences in the audio data when the recording duration is less than or equal to the preset recording duration; when the number of sentences is less than or equal to the number threshold, perform speech recognition on the audio data to obtain Text information corresponding to audio data.
  • the association module 810 is further configured to edit the corresponding candidate text abstracts in the candidate text abstracts according to the abstract editing instruction to obtain a target text abstract, and associate the target text abstract with the target content.
  • the updating module 816 is configured to input the target text abstract associated with the target content into the machine learning model, update the machine learning model, and obtain the updated machine learning model.
  • the above-mentioned summary recording device by receiving a content confirmation instruction triggered on the display interface, determines the target content from the display interface according to the content confirmation instruction; responds to the recording instruction of the target content, obtains audio data corresponding to the target content, and records
  • the command carries the recording duration; determine whether the recording duration is longer than the preset recording duration, if the recording duration is longer than the preset recording duration, if not, display abnormal information; if so, perform speech recognition on the audio data to obtain the text information corresponding to the audio data;
  • the candidate text summaries are displayed on the terminal in a preset format, the target text summaries determined from the candidate text summaries are obtained, and the target text summaries are associated with the target content; the text summaries associated with the target content are input into the preset machine learning model , the preset machine learning model is updated to obtain an updated preset machine learning model.
  • Each module in the above-mentioned abstract recording apparatus can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 10 .
  • the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used for wired or wireless communication with an external terminal, and the wireless communication can be realized by WIFI, operator network, NFC (Near Field Communication) or other technologies.
  • the computer program when executed by the processor, implements a summary recording method.
  • the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
  • FIG. 10 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, a computer program is stored in the memory, and the processor implements the following steps when executing the computer program:
  • the text information is processed by a preset number of trained machine learning models, and the corresponding candidate text summaries are obtained respectively;
  • the processor further implements the following steps when executing the computer program:
  • the processor further implements the following steps when executing the computer program:
  • the step of performing speech recognition on the audio data to obtain text information corresponding to the audio data is performed.
  • the processor further implements the following steps when executing the computer program:
  • a display label corresponding to each candidate text abstract is generated, and each candidate text abstract is folded and displayed in the display area of the terminal through the display label.
  • the processor further implements the following steps when executing the computer program:
  • Edit the text abstract to be edited according to the abstract editing instruction obtain the target text abstract, and associate the target text abstract with the target content.
  • the processor further implements the following steps when executing the computer program:
  • speech recognition is performed on the audio data to obtain text information corresponding to the audio data.
  • the processor further implements the following steps when executing the computer program:
  • the target text abstract associated with the target content is input into the machine learning model, and the machine learning model is updated to obtain the updated machine learning model.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • the text information is processed by a preset number of trained machine learning models, and the corresponding candidate text summaries are obtained respectively;
  • the computer program further implements the following steps when executed by the processor:
  • the computer program further implements the following steps when executed by the processor:
  • the step of performing speech recognition on the audio data to obtain text information corresponding to the audio data is performed.
  • the computer program further implements the following steps when executed by the processor:
  • a display label corresponding to each candidate text abstract is generated, and each candidate text abstract is folded and displayed in the display area of the terminal through the display label.
  • the computer program further implements the following steps when executed by the processor:
  • Edit the text abstract to be edited according to the abstract editing instruction obtain the target text abstract, and associate the target text abstract with the target content.
  • the computer program further implements the following steps when executed by the processor:
  • speech recognition is performed on the audio data to obtain text information corresponding to the audio data.
  • the computer program further implements the following steps when executed by the processor:
  • the target text abstract associated with the target content is input into the machine learning model, and the machine learning model is updated to obtain the updated machine learning model.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及一种摘要记录方法、装置、计算机设备和存储介质。该方法包括:接收显示界面上目标内容对应的音频数据;对音频数据进行语音识别,得到音频数据对应的文本信息;通过预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要;将各候选文本摘要以预设格式在终端上进行显示;获取从候选文本摘要中确定的目标文本摘要,将文本摘要与目标内容关联。采用本方法能够提高摘要记录的准确性。

Description

摘要记录方法、装置、计算机设备和存储介质
本申请要求于2020年8月18日提交中国专利局,申请号为2020108307795,申请名称为“摘要记录方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别是涉及一种摘要记录方法、装置、计算机设备和存储介质。
背景技术
随着电子技术的发展,移动终端的功能日益完善,但是人们对移动终端的功能要求也越来越高。用户在参加各类学习培训、社会活动会议等活动时,需要对学习内容或者会议内容进行记录。
目前,用户一般在移动终端上通过手动编辑文字来记录内容或者语音识别装置识别和存储要要记录的内容;然而,目前的记录方式存在记录准确性低的问题。
发明内容
基于此,有必要针对上述技术问题,提供一种能够提高记录准确性的摘要记录方法、装置、计算机设备和存储介质。
一种摘要记录方法,所述方法包括:
接收显示界面上目标内容对应的音频数据;
对所述音频数据进行语音识别,得到所述音频数据对应的文本信息;
通过预设数量个训练好的机器学习模型对所述文本信息进行处理,分别得到对应的候选文本摘要;
将各所述候选文本摘要以预设格式进行显示;
获取从所述候选文本摘要中确定的文本摘要,将所述文本摘要与所述目标内容关联。
在其中一个实施例中,所述接收显示界面上目标内容对应的音频数据,包括:
接收显示界面上触发的内容确认指令;
根据所述内容确认指令从所述显示界面上确定目标内容;
响应对所述目标内容的录音指令,得到所述目标内容对应的音频数据。
在其中一个实施例中,所述录音指令携带录音时长,在所述对所述音频数据进行语音识别,得到所述音频数据对应的文本信息之前,所述方法还包括:
判断所述录音时长是否大于预设录音时长;
当所述录音时长小于或等于所述预设录音时长时,执行对所述音频数据进行语音识别,得到所述音频数据对应的文本信息步骤。
在其中一个实施例中,所述将各所述候选文本摘要以预设格式在终端上进行显示包括以下任意一种形式:
将各所述候选文本摘要集以显示框的形式在终端的显示区域进行展开显示;或
生成各所述候选文本摘要对应的显示标签,通过所述显示标签将各所述候选文本摘要在终端的显示区域进行折叠显示。
在其中一个实施例中,所述获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘要与所述目标内容关联包括:
获取从所述候选文本摘要中确定的待编辑文本摘要;
接收对所述待编辑文本摘要触发的摘要编辑指令;
根据所述摘要编辑指令对所述待编辑文本摘要进行编辑,得到目标文本摘要,将所述目标文本摘要与所述目标内容关联。
在其中一个实施例中,所述方法还包括:
当所述录音时长小于或等于所述预设录音时长时,获取所述音频数据中的句子数量;
当所述句子数量小于或等于数量阈值时,对所述音频数据进行语音识别,得到所述音频数据对应的文本信息。
在其中一个实施例中,所述方法包括:
将与所述目标内容关联的目标文本摘要输入到所述机器学习模型中,对所述机器学习模型进行更新,得到更新后的机器学习模型。
一种摘要记录装置,所述装置包括:
接收模块,用于接收显示界面上目标内容对应的音频数据;
语音识别模块,用于对所述音频数据进行语音识别,得到所述音频数据对应的文本信息;
处理模块,用于通过预设数量个训练好的机器学习模型对所述文本信息进行处理,分别得到对应的候选文本摘要;
显示模块,用于将各所述候选文本摘要以预设格式在终端上进行显示;
关联模块,用于获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘要与所述目标内容关联。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
接收显示界面上目标内容对应的音频数据;
对所述音频数据进行语音识别,得到所述音频数据对应的文本信息;
通过预设数量个训练好的机器学习模型对所述文本信息进行处理,分别得到对应的候选文本摘要;
将各所述候选文本摘要以预设格式在终端上进行显示;
获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘要与所述目标内容关联。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
接收显示界面上目标内容对应的音频数据;
对所述音频数据进行语音识别,得到所述音频数据对应的文本信息;
通过预设数量个训练好的机器学习模型对所述文本信息进行处理,分别得到对应的候选文本摘要;
将各所述候选文本摘要以预设格式在终端上进行显示;
获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘要与所述目标内容关联。
上述文本摘要生成方法、装置、计算机设备和存储介质,通过对目标内容的音频数据进行识别,得到音频数据对应的文本信息;通过预设数量个训练好的机器学习模型对文本信息进行处理,得到每个机器学习模型对文本信息处理后的文本摘要;通过向用户提供音频数据的多个文本摘要,使其能够从多个文本摘要中选出准确度比较高的文本摘要,提高了记录的准确性。
附图说明
图1为一个实施例中摘要记录方法的应用环境图;
图2为一个实施例中摘要记录方法的流程示意图;
图3(a)为一个实施例中候选文本摘要显示的示意图,图3(b)为一个实施例中目标内容与目标文本摘要关联后显示的示意图;
图4为一个实施例中摘要记录中机器学习模型更新步骤的流程示意图;
图5为另一个实施例中摘要记录方法的流程示意图;
图6为一个实施例中摘要记录方法应用场景示意图;
图7为另一个实施例中摘要记录方法的应用场景;
图8为一个实施例中摘要记录装置的结构框图;
图9为另一个实施例中摘要记录装置的结构框图;
图10为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的摘要记录方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。服务器接收终端显示界面上目标内容对应的音频数据;对音频数据进行语音识别,得到音频数据对应的文本信息;通过预设数量个训练好的机器学***板电脑,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种摘要记录方法,以该方法应用于图1中的终端为例进行说明,包括以下步骤:
步骤202,接收显示界面上目标内容对应的音频数据。
其中,目标内容是终端显示界面上显示的内容。例如,在进行线上授课时,终端的显示界面上的目标内容是课程内容数据;又如,企业中多人进行视频会议时,终端的显示界 面上显示的目标内容是会议数据。音频数据是发言人对目标内容进行描述的语音数据。例如,线上授课中,音频数据是发言人对课程内容数据进行讲解的语音数据等。音频数据可以从服务器中获取,也可以通过终端的麦克风收录得到,这里对音频数据的获取方式不做限定。
具体地,终端监测到麦克风处于监听模式时,响应终端显示界面触发的与目标内容对应的录音指令,得到显示界面上目标内容的音频数据。其中,录音指令的触发方式可以通过用户触发显示界面上的录音按钮得到的;也可以通过触控或按压目标内容所在的显示区域方式来实现,触控可包括单点触控、多点触控,按压可包括长按和点击等。
步骤204,对音频数据进行语音识别,得到音频数据对应的文本信息。
具体地,终端获取目标内容的音频数据,将音频数据输入预先训练好的语音识别模型,通过语音识别模型的语音分类算法对音频数据进行分类,确定音频数据中的语音类型,从语音识别模型匹配与每种语音类型相关联的语音识别算法,通过语音识别算法对对应的音频数据进行识别,得到音频数据对应的文本信息。例如,获取的音频数据中包括中文、英文、德文等不同语言类型的音频数据,通过语音识别模型中的语音分类算法对音频数据进行分类,得到不同类型的音频数据,通过中文、英文、德文的语音识别算法分别对对应的音频数据进行识别,得到对应的文本信息。
可选地,在对音频数据进行语音识别之前,采用降噪算法对目标内容对应的音频数据进行降噪处理,例如,可以先通过采用与噪音频率相同、振幅相同、相位相反的声音进行相互抵消,然后采用去混响的音频插件或者传声器阵列消除混响。降噪算法可包括自适应滤波器、谱减法、维纳滤波法等。在对音频数据进行语音识别之前,对音频数据进行降噪处理,消除音频数据中的无效音频数据,提高了音频数据识别结果的准确性。
步骤206,通过预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要。
其中,预设数量是预先设置用来处理文本信息的机器学习模型的数量。预设数量可以是5、6、8等。预设数量个训练好的机器学习模型中每个机器学习模型的初始权值、模型训练迭代次数、超参数以及学习率不同。文本摘要是以单词或/和短语的形式对文本信息进行概括。
具体地,终端将文本信息分别输入到预设数量个训练好的机器学习模型中,通过机器学习模型对文本信息进行处理,输出与文本信息匹配的候选文本摘要。例如,通过预设数量K个训练好的机器学习模型对文本信息进行处理,得到K个候选文本摘要。
步骤208,将各候选文本摘要以预设格式进行显示。
其中,预设格式是指预先设置的显示格式。预设格式可以是将各候选文本摘要以列表的形式在终端显示界面上,各候选文本摘要在显示框中以文本列表的形式展开,显示框具有最大化和最小化的显示功能;还可以将各候选文本摘要在显示界面折叠显示,通过生成与各候选文本摘要对应的显示标签,接收显示界面触发的显示标签的查看指令,响应查看指令,将显示标签对应的文本摘要以显示框的形式显示在终端的显示界面上。
具体地,终端获取机器学习模型处理后的候选文本摘要后,响应用户在显示界面上触发的显示指令,该显示指令携带预设格式类型;将候选文本摘要以显示指令携带的预设格式类型在终端的显示界面上进行显示。图3为候选文本摘要以列表形式显示的效果示意图,显示界面的左显示区域为目标内容,右显示区域为目标内容的候选文本摘要。
步骤210,获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。
具体地,终端响应用户输入的选择指令,根据选择指令从候选文本摘要中确定目标文本摘要,通过建立目标文本摘要与目标内容之间的映射关系,通过映射关系将目标文本摘要与目标内容进行关联。可选地,终端接收显示界面触发的摘要编辑指令;根据摘要编辑指令对候选文本摘要中对应的候选文本摘要进行编辑,得到目标文本摘要,将目标文本摘要与目标内容关联。
上述摘要记录方法中,终端通过接收终端显示界面上目标内容对应的音频数据;对音频数据进行语音识别,得到音频数据对应的文本信息;再根据预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要;将各候选文本摘要以预设格式在终端上进行显示;获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。通过对音频数据进行语音识别和处理,从多个候选文本摘要中得到目标内容对应的目标文本摘要,避免了因用户手写导致记录不完整以及不准确,提高了摘要记录的准确性。
在一个实施例中,如图4所示,提供了一种摘要记录中机器学习模型更新步骤,以该方法应用于图1中的终端为例进行说明,包括以下步骤:
步骤402,获取从候选文本摘要中确定的待编辑文本摘要。
步骤404,接收显示界面触发的摘要编辑指令。
其中,编辑指令可用于对候选文本摘要进行修改、删除等。编辑指令包括删除指令、修改指令等。摘要编辑指令可通过用户点击显示界面的编辑按钮触发生成。
步骤406,根据摘要编辑指令对待编辑文本摘要进行编辑,得到目标文本摘要,将目标文本摘要与目标内容关联。
具体地,摘要编辑指令携带文本摘要标识,终端根据摘要编辑指令对文本摘要标识对应的候选文本摘要进行编辑,将编辑后的候选文本摘要作为目标文本摘要,将目标文本摘要与目标内容关联。
步骤408,将与目标内容关联的目标文本摘要输入到机器学习模型中,对机器学习模型进行更新,得到更新后的机器学习模型。
其中,机器学习模型是基于注意力机制的编码器和解码器组成的模型。通过编码器对与目标内容关联的目标文本摘要进行编码,将编码后目标文本摘要作为输入,训练机器学习模型。可选地,更新机器学习模型还可采用梯度下降法。
上述机器学习模型更新步骤中,终端通过接收显示界面触发的摘要编辑指令;根据摘要编辑指令对对应的候选文本摘要进行编辑,从候选文本摘要中确定目标文本摘要,将目标文本摘要与目标内容关联;将与目标内容关联的目标文本摘要输入到机器学习模型中,对机器学习模型进行更新,得到更新后的机器学习模型。根据目标文本摘要不断对机器学习模型进行优化,提高机器学习模型对文本信息处理结果的准确性。
在另一个实施例中,如图5所示,提供了一种摘要记录方法,以该方法应用于图1中的终端为例进行说明,包括以下步骤:
步骤502,接收显示界面上触发的内容确认指令。
其中,内容确认指令用于确定显示界面上的目标内容;内容确认指令可以通过用户在显示界面上的滑动操作或点击操作触发生成的。例如,用户可通过手指或者手写笔在显示界面上点击或滑动确定目标区域。
步骤504,根据内容确认指令从显示界面上确定目标内容。
具体地,终端响应内容确认指令,根据内容确认指令确定显示界面上的目标区域,从目标区域中获取对应的目标内容。
步骤506,响应对目标内容的录音指令,得到目标内容对应的音频数据。
在一实施例中,录音指令携带录音时长。可选地,录音指令还携带发言人标识;发言人标识用于区分不同的发言人。发言人标识可以是数字或字母组合的字符串。在多人参与的视频会议的应用场景中,用户可在显示界面上选择要录音的发言人,如图6所示,终端显示界面显示目标内容和发言人,发言人数量可以是1、2、3等,如显示界面上显示的发言人1、发言人n。
步骤508,判断录音时长是否大于预设录音时长;当录音时长小于或等于预设录音时长,执行步骤510,否则,执行步骤518。
在一个实施例中,当录音时长小于或等于预设录音时长时,获取音频数据中的句子数量;当句子数量小于或等于数量阈值时,对音频数据进行语音识别,得到音频数据对应的文本信息。
其中,数量阈值是预设语音识别模型识别音频数据的最大容量。
具体地,在进行语音识别之前,对音频数据的录音时长和音频数据中的句子数量进行判断,当录音时长小于或等于预设录音时长且句子数量小于或等于数量阈值时,终端中的预设语音识别模型对音频数据进行语音识别,得到文本信息;确保了识别得到的文本信息的准确性和完整性。
步骤510,对音频数据进行语音识别,得到音频数据对应的文本信息。
步骤512,将各候选文本摘要以预设格式在终端上进行显示。
在一个实施例中,将各候选文本摘要以预设格式在终端上进行显示包括以下任意一种形式:将各候选文本摘要集以显示框的形式在终端的显示区域进行展开显示;或生成各候选文本摘要对应的显示标签,通过显示标签将各候选文本摘要在终端的显示区域进行折叠显示。
步骤514,获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。
在一个实施例中,获取从候选文本摘要中确定的文本摘要,将文本摘要与目标内容关联包括:获取从候选文本摘要中确定的待编辑文本摘要;接收对待编辑文本摘要触发的摘要编辑指令;根据摘要编辑指令对待编辑文本摘要进行编辑,得到目标文本摘要,将目标文本摘要与目标内容关联。通过关联目标文本摘要与目标内容关联,使得用户查看记录的效率提高,查看便捷。
步骤516,将与目标内容关联的文本摘要输入到机器学习模型中,对机器学习模型进行更新,得到更新后的预设机器学习模型。
步骤518,显示异常信息。
其中,异常信息用于提示音频数据异常,即预设语音识别模型无法对音频数据进行语音识别。
以下为摘要记录方法的一个应用场景,如图7所示。
终端接收显示界面上触发的内容确认指令,根据内容确认指令从显示界面上确定目 标内容,响应用户触发点击录音按钮生成的录音指令,其中,录音指令携带录音时长,录音时长为T-N到T+N秒,得到T-N到T+N秒的音频数据。将音频数据发送给服务器,通过服务器中的预设语音识别模型对音频数据进行语音识别,得到音频数据对应的文本信息,将得到的文本信息输入到K个训练好的机器学习模型中,通过K个训练好的机器学习模型对文本进行处理,得到K个候选文本摘要,并将K个候选文本摘要发送给终端,在终端的显示界面上进行显示。
终端接收用户输入的摘要编辑指令,根据摘要编辑指令对候选文本摘要进行编辑,得到目标文本摘要,并将目标文本摘要与目标内容进行关联;把关联的目标内容和目标文本摘要作为训练样本去训练机器学习模型,得到更新后的机器学习模型,其中,关联的方式可以是建立目标内容与目标文本摘要之间的映射关系。通过终端获取目标内容的音频数据,通过机器学习模型对音频数据进行处理,得到目标内容对应的目标文本摘要,不要用户手动记录,减少了用户记录所花费的时间,以及提高了记录的效率和记录的准确性。
上述摘要记录方法中,通过接收显示界面上触发的内容确认指令,据内容确认指令从显示界面上确定目标内容;响应目标内容的录音指令,得到目标内容对应的音频数据,录音指令携带录音时长;判断录音时长是否大于预设录音时长,若录音时长是否大于预设录音时长,若否,显示异常信息;若是,对音频数据进行语音识别,得到音频数据对应的文本信息;将各候选文本摘要以预设格式在终端上进行显示,获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联;将与目标内容关联的文本摘要输入到预设机器学习模型中,对预设机器学习模型进行更新,得到更新后的预设机器学习模型。通过向用户提供音频数据的多个文本摘要,使其能够从多个文本摘要中选出准确度比较高的文本摘要,以及根据目标文本摘要不断对机器学习模型进行优化,提高机器学习模型对文本信息处理结果和摘要记录的准确性。
应该理解的是,虽然图2、图4-5的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2、图4-5中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图9所示,提供了一种摘要记录装置,包括:接收模块802、语音识别模块804、处理模块806、显示模块808和关联模块810,其中:
接收模块802,用于接收显示界面上目标内容对应的音频数据;
语音识别模块804,用于对音频数据进行语音识别,得到音频数据对应的文本信息;
处理模块806,用于通过预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要;
显示模块808,用于将各候选文本摘要以预设格式在终端上进行显示;
关联模块810,用于获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。
上述摘要记录装置中,通过终端中接收模块802接收显示界面上目标内容对应的音频数据;语音识别模块804对接收的音频数据进行语音识别,得到所述音频数据对应的文本信息;根据处理模块806中预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要;显示模块808将各候选文本摘要以预设格式在终端上进行显示;并通过关联模块810获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。通过对音频数据进行语音识别和处理,向用户提供音频数据的多个文本摘要,使其从多个候选文本摘要中得到目标内容对应的目标文本摘要,避免了因用户手写导致记录不完整以及不准确,提高了摘要记录的准确性。
在另一个实施例中,如图8所示,提供了一种摘要记录装置,除包括接收模块802、语音识别模块804、处理模块806、显示模块808和关联模块810之外,还包括:响应模块812、判断模块814和更新模块816,其中:
在一个实施例中,接收模块802还用于接收显示界面上触发的内容确认指令;根据内容确认指令从显示界面上确定目标内容。
在一个实施例中,接收模块802还用于接收显示界面触发的摘要编辑指令。
响应模块812,用于响应目标内容的录音指令,得到目标内容对应的音频数据。
在一个实施例中,显示模块808还用于将各候选文本摘要集以显示框的形式在终端的显示区域进行展开显示;
还用于生成各候选文本摘要对应的显示标签,通过显示标签将各候选文本摘要在终端的显示区域进行折叠显示。
判断模块814,用于判断录音时长是否大于预设录音时长;当录音时长小于或等于预 设录音时长时,执行对音频数据进行语音识别,得到音频数据对应的文本信息步骤。
在一个实施例中,判断模块814还用于当录音时长小于或等于预设录音时长时,获取音频数据中的句子数量;当句子数量小于或等于数量阈值时,对音频数据进行语音识别,得到音频数据对应的文本信息。
在一个实施例中,关联模块810还用于根据摘要编辑指令对候选文本摘要中对应的候选文本摘要进行编辑,得到目标文本摘要,将目标文本摘要与目标内容关联。
更新模块816,用于将与目标内容关联的目标文本摘要输入到机器学习模型中,对机器学习模型进行更新,得到更新后的机器学习模型。
在一个实施例中,上述摘要记录装置,通过接收显示界面上触发的内容确认指令,据内容确认指令从显示界面上确定目标内容;响应目标内容的录音指令,得到目标内容对应的音频数据,录音指令携带录音时长;判断录音时长是否大于预设录音时长,若录音时长是否大于预设录音时长,若否,显示异常信息;若是,对音频数据进行语音识别,得到音频数据对应的文本信息;将各候选文本摘要以预设格式在终端上进行显示,获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联;将与目标内容关联的文本摘要输入到预设机器学习模型中,对预设机器学习模型进行更新,得到更新后的预设机器学习模型。通过向用户提供音频数据的多个文本摘要,使其能够从多个文本摘要中选出准确度比较高的文本摘要,以及根据目标文本摘要不断对机器学习模型进行优化,提高机器学习模型对文本信息处理结果和摘要记录的准确性。
关于摘要记录装置的具体限定可以参见上文中对于摘要记录方法的限定,在此不再赘述。上述摘要记录装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图10所示。该计算机设备包括通过***总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***和计算机程序。该内存储器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以 实现一种摘要记录方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现以下步骤:
接收显示界面上目标内容对应的音频数据;
对音频数据进行语音识别,得到音频数据对应的文本信息;
通过预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要;
将各候选文本摘要以预设格式在终端上进行显示;
获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:
接收显示界面上触发的内容确认指令;
根据内容确认指令从显示界面上确定目标内容;
响应对目标内容的录音指令,得到目标内容对应的音频数据。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:
判断录音时长是否大于预设录音时长;
当录音时长小于或等于预设录音时长时,执行对音频数据进行语音识别,得到音频数据对应的文本信息步骤。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:
将各候选文本摘要以预设格式在终端上进行显示包括以下任意一种形式:
将各候选文本摘要集以显示框的形式在终端的显示区域进行展开显示;或
生成各候选文本摘要对应的显示标签,通过显示标签将各候选文本摘要在终端的显示区域进行折叠显示。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:
获取从候选文本摘要中确定的待编辑文本摘要;
接收对待编辑文本摘要触发的摘要编辑指令;
根据摘要编辑指令对待编辑文本摘要进行编辑,得到目标文本摘要,将目标文本摘要与目标内容关联。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:
当录音时长小于或等于预设录音时长时,获取音频数据中的句子数量;
当句子数量小于或等于数量阈值时,对音频数据进行语音识别,得到音频数据对应的文本信息。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:
将与目标内容关联的目标文本摘要输入到机器学习模型中,对机器学习模型进行更新,得到更新后的机器学习模型。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:
接收显示界面上目标内容对应的音频数据;
对音频数据进行语音识别,得到音频数据对应的文本信息;
通过预设数量个训练好的机器学习模型对文本信息进行处理,分别得到对应的候选文本摘要;
将各候选文本摘要以预设格式在终端上进行显示;
获取从候选文本摘要中确定的目标文本摘要,将目标文本摘要与目标内容关联。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:
接收显示界面上触发的内容确认指令;
根据内容确认指令从显示界面上确定目标内容;
响应对目标内容的录音指令,得到目标内容对应的音频数据。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:
判断录音时长是否大于预设录音时长;
当录音时长小于或等于预设录音时长时,执行对音频数据进行语音识别,得到音频数据对应的文本信息步骤。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:
将各候选文本摘要以预设格式在终端上进行显示包括以下任意一种形式:
将各候选文本摘要集以显示框的形式在终端的显示区域进行展开显示;或
生成各候选文本摘要对应的显示标签,通过显示标签将各候选文本摘要在终端的显示区域进行折叠显示。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:
获取从候选文本摘要中确定的待编辑文本摘要;
接收对待编辑文本摘要触发的摘要编辑指令;
根据摘要编辑指令对待编辑文本摘要进行编辑,得到目标文本摘要,将目标文本摘要与目标内容关联。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:
当录音时长小于或等于预设录音时长时,获取音频数据中的句子数量;
当句子数量小于或等于数量阈值时,对音频数据进行语音识别,得到音频数据对应的文本信息。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:
将与目标内容关联的目标文本摘要输入到机器学习模型中,对机器学习模型进行更新,得到更新后的机器学习模型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范 围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (14)

  1. 一种摘要记录方法,包括:
    接收显示界面上目标内容对应的音频数据;
    对所述音频数据进行语音识别,得到所述音频数据对应的文本信息;
    通过预设数量个训练好的机器学习模型对所述文本信息进行处理,分别得到对应的候选文本摘要;
    将各所述候选文本摘要以预设格式在终端上进行显示;
    获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘要与所述目标内容关联。
  2. 根据权利要求1所述的方法,其中,所述接收显示界面上目标内容对应的音频数据,包括:
    接收显示界面上触发的内容确认指令;
    根据所述内容确认指令从所述显示界面上确定目标内容;
    响应对所述目标内容的录音指令,得到所述目标内容对应的音频数据。
  3. 根据权利要求2所述的方法,其中,所述录音指令携带录音时长,在所述对所述音频数据进行语音识别,得到所述音频数据对应的文本信息之前,所述方法还包括:
    判断所述录音时长是否大于预设录音时长;
    当所述录音时长小于或等于所述预设录音时长时,执行对所述音频数据进行语音识别,得到所述音频数据对应的文本信息步骤。
  4. 根据权利要求1所述的方法,其中,所述将各所述候选文本摘要以预设格式在终端上进行显示包括以下任意一种形式:
    将各所述候选文本摘要集以显示框的形式在终端的显示区域进行展开显示;或
    生成各所述候选文本摘要对应的显示标签,通过所述显示标签将各所述候选文本摘要在终端的显示区域进行折叠显示。
  5. 根据权利要求1所述的方法,其中,所述获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘要与所述目标内容关联包括:
    获取从所述候选文本摘要中确定的待编辑文本摘要;
    接收对所述待编辑文本摘要触发的摘要编辑指令;
    根据所述摘要编辑指令对所述待编辑文本摘要进行编辑,得到目标文本摘要,将所述目标文本摘要与所述目标内容关联。
  6. 根据权利要求2所述的方法,还包括:
    当所述录音时长小于或等于所述预设录音时长时,获取所述音频数据中的句子数量;
    当所述句子数量小于或等于数量阈值时,对所述音频数据进行语音识别,得到所述音频数据对应的文本信息。
  7. 根据权利要求1所述的方法,还包括:
    将与所述目标内容关联的目标文本摘要输入到所述机器学习模型中,对所述机器学习模型进行更新,得到更新后的机器学习模型。
  8. 根据权利要求1所述的方法,还包括:在未能识别所述音频数据时,显示异常信息。
  9. 根据权利要求1所述的方法,还包括:在对所述音频数据进行语音识别,得到所述音频数据对应的文本信息之前,采用降噪算法对所述目标内容对应的音频数据进行降噪处理。
  10. 根据权利要求1所述的方法,其中,所述对所述音频数据进行语音识别,得到所述音频数据对应的文本信息,包括:
    将所述音频数据输入预先训练好的语音识别模型;
    通过所述语音识别模型的语音分类算法对所述音频数据进行分类,以确定所述音频数据中的语音类型;
    从所述语音识别模型匹配与每种语音类型相关联的语音识别算法;
    通过所述语音识别算法对对应的音频数据进行识别,以得到所述音频数据对应的文本信息。
  11. 根据权利要求1所述的方法,还包括:在所述接收显示界面上目标内容对应的音频数据之前,从服务器中获取或通过麦克风获得所述音频数据。
  12. 一种文本摘要生成装置,包括:
    接收模块,用于接收显示界面上目标内容对应的音频数据;
    语音识别模块,用于对所述音频数据进行语音识别,得到所述音频数据对应的文本信息;
    处理模块,用于通过预设数量个训练好的机器学习模型对所述文本信息进行处理,分别得到对应的候选文本摘要;
    显示模块,用于将各所述候选文本摘要以预设格式在终端上进行显示;
    关联模块,用于获取从所述候选文本摘要中确定的目标文本摘要,将所述目标文本摘 要与所述目标内容关联。
  13. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现权利要求1至11中任一项所述的方法的步骤。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至11中任一项所述的方法的步骤。
PCT/CN2021/113206 2020-08-18 2021-08-18 摘要记录方法、装置、计算机设备和存储介质 WO2022037600A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010830779.5 2020-08-18
CN202010830779.5A CN114155860A (zh) 2020-08-18 2020-08-18 摘要记录方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022037600A1 true WO2022037600A1 (zh) 2022-02-24

Family

ID=80322579

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/113206 WO2022037600A1 (zh) 2020-08-18 2021-08-18 摘要记录方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN114155860A (zh)
WO (1) WO2022037600A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114690997A (zh) * 2022-04-15 2022-07-01 北京百度网讯科技有限公司 文本显示方法及装置、设备、介质和产品
CN115334367A (zh) * 2022-07-11 2022-11-11 北京达佳互联信息技术有限公司 视频的摘要信息生成方法、装置、服务器以及存储介质
CN117786098A (zh) * 2024-02-26 2024-03-29 深圳波洛斯科技有限公司 基于多模态大语言模型的电话录音摘要提取方法、装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818747A (zh) * 2022-04-21 2022-07-29 语联网(武汉)信息技术有限公司 语音序列的计算机辅助翻译方法、***与可视化终端

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177469A1 (en) * 2005-02-22 2009-07-09 Voice Perfect Systems Pty Ltd System for recording and analysing meetings
CN107168954A (zh) * 2017-05-18 2017-09-15 北京奇艺世纪科技有限公司 文本关键词生成方法及装置和电子设备及可读存储介质
CN108810446A (zh) * 2018-06-07 2018-11-13 北京智能管家科技有限公司 一种视频会议的标签生成方法、装置、设备和介质
CN108847241A (zh) * 2018-06-07 2018-11-20 平安科技(深圳)有限公司 将会议语音识别为文本的方法、电子设备及存储介质
CN109635103A (zh) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 摘要生成方法和装置
CN110675864A (zh) * 2019-09-12 2020-01-10 上海依图信息技术有限公司 一种语音识别方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177469A1 (en) * 2005-02-22 2009-07-09 Voice Perfect Systems Pty Ltd System for recording and analysing meetings
CN107168954A (zh) * 2017-05-18 2017-09-15 北京奇艺世纪科技有限公司 文本关键词生成方法及装置和电子设备及可读存储介质
CN108810446A (zh) * 2018-06-07 2018-11-13 北京智能管家科技有限公司 一种视频会议的标签生成方法、装置、设备和介质
CN108847241A (zh) * 2018-06-07 2018-11-20 平安科技(深圳)有限公司 将会议语音识别为文本的方法、电子设备及存储介质
CN109635103A (zh) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 摘要生成方法和装置
CN110675864A (zh) * 2019-09-12 2020-01-10 上海依图信息技术有限公司 一种语音识别方法及装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114690997A (zh) * 2022-04-15 2022-07-01 北京百度网讯科技有限公司 文本显示方法及装置、设备、介质和产品
CN114690997B (zh) * 2022-04-15 2023-07-25 北京百度网讯科技有限公司 文本显示方法及装置、设备、介质和产品
CN115334367A (zh) * 2022-07-11 2022-11-11 北京达佳互联信息技术有限公司 视频的摘要信息生成方法、装置、服务器以及存储介质
CN115334367B (zh) * 2022-07-11 2023-10-17 北京达佳互联信息技术有限公司 视频的摘要信息生成方法、装置、服务器以及存储介质
CN117786098A (zh) * 2024-02-26 2024-03-29 深圳波洛斯科技有限公司 基于多模态大语言模型的电话录音摘要提取方法、装置
CN117786098B (zh) * 2024-02-26 2024-05-07 深圳波洛斯科技有限公司 基于多模态大语言模型的电话录音摘要提取方法、装置

Also Published As

Publication number Publication date
CN114155860A (zh) 2022-03-08

Similar Documents

Publication Publication Date Title
WO2022037600A1 (zh) 摘要记录方法、装置、计算机设备和存储介质
US11573993B2 (en) Generating a meeting review document that includes links to the one or more documents reviewed
US11270060B2 (en) Generating suggested document edits from recorded media using artificial intelligence
US11080466B2 (en) Updating existing content suggestion to include suggestions from recorded media using artificial intelligence
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
US11263384B2 (en) Generating document edit requests for electronic documents managed by a third-party document management service using artificial intelligence
US10114809B2 (en) Method and apparatus for phonetically annotating text
CN107481720B (zh) 一种显式声纹识别方法及装置
US11720741B2 (en) Artificial intelligence assisted review of electronic documents
US20220254348A1 (en) Automatically generating a meeting summary for an information handling system
US10270736B2 (en) Account adding method, terminal, server, and computer storage medium
US11392754B2 (en) Artificial intelligence assisted review of physical documents
US11950020B2 (en) Methods and apparatus for displaying, compressing and/or indexing information relating to a meeting
US20170011114A1 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
US20190348063A1 (en) Real-time conversation analysis system
US10282417B2 (en) Conversational list management
CN117520498A (zh) 基于虚拟数字人交互处理方法、***、终端、设备及介质
CN107066864A (zh) 一种应用图标显示方法及其设备
US20230138820A1 (en) Real-time name mispronunciation detection
WO2021159756A1 (zh) 基于多模态的响应义务检测方法、***及装置
US20230153061A1 (en) Hierarchical Context Specific Actions from Ambient Speech
US11714970B2 (en) Systems and methods for detecting deception in computer-mediated communications
CN110929122B (zh) 一种数据处理方法、装置和用于数据处理的装置
US20230342557A1 (en) Method and system for training a virtual agent using optimal utterances
US11789944B2 (en) User-specific computer interaction recall

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857692

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21857692

Country of ref document: EP

Kind code of ref document: A1