CN115116467A

CN115116467A - Audio marking method and device and electronic equipment

Info

Publication number: CN115116467A
Application number: CN202210730526.XA
Authority: CN
Inventors: 苏天镜; 李想; 王斌; 郑康; 张鼎; 赵田; 耿泽; 潘灶烽; 王舒然; 于书悦
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-27

Abstract

The disclosure relates to an audio marking method, an audio marking device and electronic equipment, and particularly relates to the technical field of voice recognition. The method comprises the following steps: acquiring a target conference audio; clustering and splitting the target audio to obtain at least one speaker audio; acquiring a first digital mark according to a target speaker audio, wherein the target speaker audio is one of the at least one speaker audio; determining a second digitized mark matched with the first digitized mark and registered; tagging the targeted speaker audio according to targeted user information associated with the second digitized tag. The embodiment of the disclosure is used for solving the problem that a speaker identification mode in a conference at present cannot acquire specific speaker information corresponding to an audio segment for a viewer of conference audio.

Description

Audio marking method and device and electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to an audio tagging method and apparatus, and an electronic device.

Background

Because each speaker's voice has a unique characteristic, the voices of different speakers can be effectively identified and distinguished through the characteristic. At present, in a conference room scene, audio in a conference room can be obtained, audio segments corresponding to different speakers are obtained by clustering and splitting the audio in the conference room, and then different distinguishing identifiers (for example, a speaker 1, a speaker 2, a speaker 3, and the like) are adopted to distinguish and label the audio segments corresponding to the different speakers.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides an audio tagging method, apparatus, and electronic device, which can match a first digitized tag extracted from a target speaker audio (a speaker audio) in a target audio with a registered digitized tag, and tag the target speaker audio with target user information associated with a second digitized tag when matching with the second digitized tag is successful.

In order to achieve the above purpose, the technical solutions provided by the embodiments of the present disclosure are as follows:

in a first aspect, an audio marking method is provided, including:

acquiring a target audio;

clustering and splitting the target audio to obtain at least one speaker audio;

acquiring a first digital mark according to a target speaker audio, wherein the target speaker audio is one of the at least one speaker audio;

determining a registered second digitized token that matches the first digitized token;

tagging the targeted speaker audio according to targeted user information associated with the second digitized tag.

As an optional implementation manner of this embodiment of the present disclosure, the determining the registered second digitized mark matching with the first digitized mark includes:

determining at least one participant information in a meeting schedule corresponding to the target audio;

acquiring a registered digital mark corresponding to the at least one participant information;

and determining a registered second digital mark matched with the first digital mark from the registered digital marks corresponding to the at least one participant information.

As an alternative implementation of the disclosed embodiments,

the target audio is conference audio, and before determining the registered second digitized mark matching the first digitized mark, the method further comprises:

determining at least one participant device corresponding to the target audio;

acquiring a registered digital mark corresponding to the at least one participant device;

and determining a registered second digital mark matched with the first digital mark from the registered digital marks corresponding to the at least one participant device.

As an optional implementation manner of the embodiment of the present disclosure, before the target audio is conference audio, and before determining the registered second digitized tag matching the first digitized tag, the method further includes:

acquiring a first audio corresponding to the target meeting equipment from the first meeting audio;

under the condition that the speaker of the first audio is unique, acquiring the second digital mark according to the first audio;

acquiring the target user information corresponding to the target participating device;

and correspondingly storing the target participating device, the second digital mark and the target user information so as to register the second digital mark.

As an optional implementation manner of the embodiment of the present disclosure, the method further includes:

under the condition that the speaker of the audio corresponding to the target participating device is not unique, when the target participating device enters a second conference, acquiring the audio of the second conference;

acquiring second audio corresponding to the target conference equipment from the second conference audio;

under the condition that the speaker of the first audio is unique, acquiring the second digital mark according to the second audio;

under the condition that the first digital mark is not matched with all registered digital marks, marking the target speaker audio by adopting a target speaker identifier;

the target speaker identification is used for distinguishing a speaker corresponding to the audio of the target speaker from other speakers, and the other speakers are speakers corresponding to the audio of the at least one speaker except the audio of the target speaker.

As an optional implementation manner of the embodiment of the present disclosure, after the tagging of the audio of the target speaker according to the target user information associated with the second digital tag, the method further includes:

generating target character information based on the target speaker audio;

and displaying the target user information and the target character information in an associated manner.

In a second aspect, there is provided an audio tagging device comprising:

the acquisition module is used for acquiring a target audio;

the clustering module is used for clustering and splitting the target audio to obtain at least one speaker audio;

the system comprises a mark extraction module, a first digital mark extraction module and a second digital mark extraction module, wherein the mark extraction module is used for acquiring a first digital mark according to a target speaker audio, and the target speaker audio is one speaker audio in at least one speaker audio;

a matching module for determining a registered second digitized token that matches the first digitized token;

a tagging module for tagging the audio of the target speaker according to target user information associated with the second digitized tag.

As an optional implementation manner of the embodiment of the present disclosure, the target audio is a conference audio, and the matching module is specifically configured to:

As an optional implementation manner of the embodiment of the present disclosure, the matching module is further configured to:

prior to determining a registered second digitized token that matches the first digitized token, determining at least one participant device to which the target audio corresponds;

As an optional implementation manner of the embodiment of the present disclosure, the target audio is conference audio, and the apparatus further includes:

the registration module is used for acquiring first audio corresponding to the target conference equipment from the first conference audio before the registered second digital mark matched with the first digital mark is determined;

As an optional implementation manner of the embodiment of the present disclosure, the registration module 905 is further configured to:

acquiring the target user information corresponding to the target meeting participating device;

As an optional implementation manner of the embodiment of the present disclosure, the marking module is further configured to:

As an optional implementation manner of the embodiment of the present disclosure, the apparatus further includes: the display module is used for generating target character information based on the target speaker audio after the marking module marks the target speaker audio according to the target user information associated with the second digital mark; and displaying the target user information and the target character information in an associated manner.

In a third aspect, an electronic device is provided, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the audio tagging method according to the first aspect or any one of its alternative embodiments.

In a fourth aspect, a computer-readable storage medium is provided, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements an audio tagging method as set forth in the first aspect or any one of its alternative embodiments.

In a fifth aspect, a computer program product is provided, comprising: when the computer program product is run on a computer, the computer is caused to implement the audio tagging method of the first aspect or any one of its alternative embodiments.

The embodiment of the disclosure provides an audio marking method, an audio marking device and electronic equipment, wherein the method comprises the following steps: acquiring a target audio; clustering and splitting the target audio to obtain at least one speaker audio; acquiring a first digital mark according to a target speaker audio, wherein the target speaker audio is one of the at least one speaker audio; and acquiring target user information associated with a second digital mark under the condition that the first digital mark is matched with the registered second digital mark. By the scheme, the first digital mark extracted from the target speaker audio (one speaker audio) in the target audio is matched with the registered digital mark, and when the first digital mark is successfully matched with the second digital mark, the target speaker audio is marked by the target user information associated with the second digital mark.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an implementation scenario of an audio tagging method according to an embodiment of the present disclosure;

fig. 2 is a schematic view of an implementation scenario of another audio tagging method provided in an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of an audio tagging method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a page for displaying a meeting record according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of another page for displaying a meeting record according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of another audio tagging method provided by an embodiment of the present disclosure;

fig. 7 is a schematic flowchart illustrating a process of performing digital signature registration according to an embodiment of the present disclosure;

fig. 8 is a schematic flow chart illustrating a process of matching a digital signature according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of an audio tagging device according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The inventor finds that, in a conference room scene, audio in the conference room can be acquired, audio segments corresponding to different speakers can be acquired by clustering and splitting the audio in the conference room, and then different distinguishing identifiers (such as speaker 1, speaker 2, speaker 3, and the like) are adopted to distinguish and label the audio segments corresponding to the different speakers.

In order to solve the above problem, embodiments of the present disclosure provide an audio tagging method, an audio tagging apparatus, and an electronic device, in which a viewer of an audio can directly view user information tagged for different speakers in the audio, so as to obtain specific speaker information.

It is understood that before or during the application of the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed of the type, usage scope, usage scenario, etc. of the personal information (e.g., user information, information such as digitized marks of audio, etc.) involved in the present disclosure in a proper manner according to relevant laws and regulations and obtain the authorization of the user.

For example, in the embodiment of the disclosure, in relation to registration of the digitized mark and matching of the digitized mark, in practical applications, before performing registration of the digitized mark and matching of the digitized mark, a user authorization may be applied to allow the speaker recognition function to be turned on, and the digitized mark corresponding to the user audio may be obtained.

For another example, in the embodiment of the present disclosure, before the step of obtaining the user information, the user may be authorized to allow obtaining the user information, and after the user allows obtaining the user information, the user may obtain the user information.

For another example, when a user's active request is received to request that a certain operation be performed, a prompt message may be sent to the user to explicitly prompt the user that the operation requested to be performed will require the acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.

As an optional but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the popup.

It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.

The audio marking method, the device and the electronic equipment provided by the embodiment of the disclosure can match a first digital mark extracted from a target speaker audio (a speaker audio) in a target audio with a registered digital mark, and mark the target speaker audio by adopting target user information associated with a second digital mark when the first digital mark is successfully matched with the second digital mark.

It should be noted that the target audio related in the embodiment of the present disclosure may be audio in any scene, for example, may be conference audio, and may also be audio recorded in a certain indoor scene, audio recorded in a certain telephone call process, and the like. The following will exemplarily describe the audio marking method provided by the embodiment of the present disclosure by taking a conference scene and a target audio as conference audio as examples.

As shown in fig. 1, a schematic view of an implementation scenario of an audio tagging method provided in this disclosure is shown, where a server 101 and 3 terminal devices are involved in the scenario, and the scenario includes a terminal device 102, a terminal device 103, and a terminal device 104, where the 3 users respectively adopt the terminal device 102 for a user a and the terminal device 103 for a user B, a user C and a user D perform an online conference through the terminal device 104, the terminal device 102 may record a conference audio and send the conference audio to the server 101, and the server may obtain detailed information of the user a, the user B, the user C, and the user D through the method provided in this disclosure, and mark a conference audio content corresponding to each user with the information to generate a conference record and feed the conference record back to the terminal device 102.

As shown in fig. 2, a schematic view of another implementation scenario of the audio tagging method provided in the embodiment of the present disclosure is provided, where a server 201 and a conference room device 202 are involved in the scenario, a user D, a user E, and a user F all conduct a conference through the conference room device 202, and the conference room device 202 may record conference audio and send the conference audio to the server 101. The server may obtain the detailed information of the user D, the user E, and the user F by using the method provided by the embodiment of the present disclosure, and label the conference audio content corresponding to each user by using the information, so as to generate a conference record, and feed back the conference record to the terminal device 202.

The audio marking method provided in the embodiments of the present disclosure may be implemented by an electronic device. In a specific application, the electronic device may be a server, or the electronic device may also be a terminal device. When the electronic device is a server, the execution main body of the execution method may be a server program running in the server and corresponding to an information interaction terminal with a voice interaction function. When the electronic device is a terminal device, the execution subject of the execution method can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and information interaction terminals with voice interaction functions. In specific application, the information interaction terminal can be an intelligent interaction device with a voice interaction function, such as an intelligent robot, an intelligent household appliance and the like; or, the information interaction terminal may be a client terminal with a voice interaction function, for example, a video client terminal, an educational learning client terminal, and the like. In addition, it will be appreciated that the client may be a web page type client, or, an app (application) type client, as is reasonable.

As shown in fig. 3, a schematic flowchart of an audio tagging method provided in an embodiment of the present disclosure is shown, where the method includes:

301. and acquiring the target audio.

The target audio may be a conference audio recorded for any conference, or may also be an audio recorded in other scenes, for example, an audio recorded in a certain indoor scene, an audio recorded in a certain telephone call process, and the like.

For example, the target audio may be conference audio recorded in a conference scene as shown in fig. 1 or fig. 2.

302. And clustering and splitting the target audio to obtain at least one speaker audio.

Wherein, the target audio comprises at least one audio frequency segment of a speaker. The audio segment includes a human voice audio segment and a noise segment, wherein the noise segment includes but is not limited to a mute segment, an ambient noise, and a white noise, for example, the noise segment includes a blank audio, and an electronic noise generated by a device recording process.

In some embodiments, Voice audio may be extracted from the target audio through Voice Activity Detection (VAD), and the Voice audio is clustered and split to obtain at least one speaker audio.

In some embodiments, the clustering and splitting of the target audio may be performed to obtain at least one speaker audio by: the method comprises the steps of segmenting a target audio or the human voice audio to obtain a plurality of human voice audio segments, extracting audio characteristic information of each human voice audio segment from the plurality of human voice audio segments, clustering the plurality of human voice audio segments according to the audio characteristic information to obtain at least one audio segment, and considering each audio segment obtained by clustering as the audio of one speaker because the similarity between the audio characteristic information is taken as a clustering basis in the audio clustering process, so that the at least one speaker audio corresponds to at least one speaker, and the speakers corresponding to different audio segments are different.

The process of clustering and splitting the target audio can be realized by adopting various clustering algorithms. The clustering algorithm may include: a distance-based clustering algorithm or a density-based clustering algorithm, etc., which are not specifically limited in this disclosure. For example, Spectral Clustering, K-Means Clustering, mean shift Clustering, Expectation-Maximization (EM) Clustering using Gaussian Mixture Model (GMM), agglomerative-hierarchical Clustering, Graph Community Detection (Graph Community Detection), and the like may be employed.

The audio feature information may be short-Time spectrum features such as Mel-Frequency Cepstral coefficients (MFCCs), Perceptual Linear Prediction (PLP), Filter Banks (fbanks), or features extracted based on a Time-Delay Neural network (TDNN), such as identity-vector (i-vector).

In some embodiments, any way of calculating the similarity between the audio feature information is applied to audio clustering, for example, the feature sequences of the audio feature information may be compared, and the similarity between the feature sequences is used to calculate the similarity between the audio feature information, or alternatively, the audio feature information may be subjected to vectorization processing, the distance between the vectorized audio feature information is calculated, and the reciprocal of the distance between the voiceprint features is used as the similarity of the voiceprint features. The method is not limited thereto, of course, and the disclosure is not limited thereto.

In some embodiments, a predetermined number of pieces of audio feature information in the audio feature information of the human voice audio sub-segment are obtained, similarity between every two pieces of the predetermined number of pieces of audio feature information is calculated, and if the similarity between two pieces of audio feature information is greater than a maximum similarity threshold, the two pieces of voiceprint features are grouped into one type until the similarity between each piece of audio feature information in the predetermined number and each piece of audio feature information in a cluster type where the piece of audio feature information is located is greater than the maximum similarity threshold.

Illustratively, setting a minimum similarity threshold value of 0.50, a maximum similarity threshold value of 0.85 and a preset number of 10, triggering clustering according to audio feature information after the audio feature information of the human voice sub-segments is extracted, selecting 10 pieces of feature information from the extracted audio feature information of the human voice sub-segments, calculating the similarity between every two pieces of the audio feature information, and if the similarity is greater than 0.85, clustering the two pieces of audio feature information into one class until the similarity between each piece of audio feature information in the selected 10 pieces of audio feature information and each piece of audio feature information in a clustering class where the audio feature information is located is greater than 0.85.

303. A first digitized token is obtained from the targeted speaker audio.

Wherein the targeted speaker audio is one of the at least one speaker audio.

In the embodiment of the present disclosure, the digital mark (such as the first digital mark, the second digital mark, etc.) is a sound wave spectrum carrying speech information, and is used for distinguishing and identifying the user sound, and the digital mark may be a voiceprint feature, or may be other feature information besides the voiceprint, which is used for identifying and distinguishing the user voiceprint.

304. The first digitized mark is matched with the registered digitized marks.

305. And in the case that the first digital mark is matched with the registered second digital mark, acquiring the target user information associated with the second digital mark.

After the first digitized token is obtained from the audio of the target speaker, a registered second digitized token that matches the first digitized token can be determined.

In some embodiments, determining that the registered second digitized token matches the first digitized token may include first matching the first digitized token with the registered digitized token and then retrieving the target user information associated with the second digitized token if the first digitized token matches the registered second digitized token.

The registered digital mark may refer to all digital marks stored in the database, or may be a part of digital marks stored in the database.

In some embodiments, when obtaining the target user information associated with the second digital mark, the user may be requested to authorize the obtaining of the target user information, and the user may obtain the target user information after the authorization is granted.

In some embodiments, after the target audio is obtained, at least one piece of participant information in a conference schedule corresponding to the target audio can be further determined; acquiring a registered digital mark corresponding to at least one piece of participant information; and then, determining a registered second digital mark matched with the first digital mark from the registered digital marks corresponding to the at least one participant information, and acquiring target user information associated with the second digital mark, namely, respectively matching the first digital mark with the registered digital marks, and acquiring the target user information associated with the second digital mark under the condition that the first digital mark is matched with the second digital mark in the registered digital marks.

Because a meeting schedule corresponding to a meeting is usually made before some meetings start to inform participants of the meeting, and the meeting schedule usually includes information of the participants, after at least one piece of information of the participants in the meeting schedule corresponding to a target audio is determined in the above manner, a registered digital mark corresponding to at least one piece of information of the participants can be obtained from a database, so that the matching range for matching the first digital mark with the registered digital mark subsequently can be reduced, the data volume during matching can be reduced, the matching efficiency is improved, and the equipment overhead is reduced.

In some embodiments, after the target audio is obtained, at least one participant device corresponding to the target audio may be further determined, the registered digital tags corresponding to the at least one participant device are obtained, then, from the registered digital tags corresponding to the at least one participant device, a registered second digital tag matching the first digital tag is determined, and target user information associated with the second digital tag is obtained, that is, the first digital tag and the registered digital tags are respectively matched, and in the case that the first digital tag matches the second digital tag in the registered digital tags, the target user information associated with the second digital tag is obtained.

Typically there are one or more participating devices in an online meeting, as shown in fig. 1, there are 3 participating devices: as shown in fig. 2, there are 2 participating devices, namely, a terminal device 102, a terminal device 103, and a terminal device 104: conference room equipment 202. Under the condition that the information of the participant equipment and the digital marks are correspondingly stored in the database, the registered digital marks corresponding to the at least one participant equipment can be obtained by determining the at least one participant equipment corresponding to the target audio, and the registered digital marks corresponding to the at least one participant equipment can be obtained from the database firstly, so that the matching range for matching the first digital marks with the registered digital marks subsequently can be narrowed, the data volume during matching can be reduced, the matching efficiency is improved, and the equipment overhead is reduced.

In some embodiments, a first audio corresponding to the target participating device may be first obtained from the historically stored first conference audio, and then it is determined whether the speaker of the first audio is unique, and in the case that it is determined that the speaker of the first audio is unique, a second digitized mark is obtained according to the first audio, that is, the second digitized mark is extracted from the first audio; then, target user information corresponding to the target participating device can be obtained; and finally, correspondingly storing the target participant equipment, the second digital mark and the target user information so as to register the second digital mark. After registration is complete, the first digitized token can be matched with the registered second digitized token.

Correspondingly, under the condition that the speaker of the audio corresponding to the target participating device is not unique, when the target participating device enters a second conference, the audio of the second conference is obtained; acquiring a second audio corresponding to the target meeting equipment from the second meeting audio; under the condition that the speaker of the first audio is unique, acquiring a second digital mark according to the second audio; acquiring target user information corresponding to target participating equipment; and correspondingly storing the target participating device, the second digital mark and the target user information so as to register the second digital mark.

The second conference is a conference after the first conference and before the conference corresponding to the target audio, that is, under the condition that the in-place registration through the first conference is successful, the second conference audio of the second conference entered through the subsequent target participating device is subjected to digital marking extraction again and registration.

The above-mentioned determining whether the speaker of the first audio is unique may be performed by splitting the first audio into a plurality of audio segments, then performing clustering and splitting processing on the plurality of audio segments through a clustering algorithm, determining that the speaker of the first audio is unique if the plurality of audio segments are clustered into one cluster after being subjected to clustering and splitting processing through the clustering algorithm, and determining that the speaker of the first audio is not unique if the plurality of audio segments are clustered into a plurality of clusters after being subjected to clustering and splitting processing through the clustering algorithm.

306. And marking the audio frequency of the target speaker according to the target user information.

The target user information may include: name, avatar, phone number, etc. of the target user.

In some embodiments, the tagging of the targeted speaker audio according to the targeted user information may include: and identifying the audio frequency of the target speaker to obtain corresponding target text content, and marking the audio frequency and/or the target text content of the target speaker according to the target user information.

In the disclosed embodiment, the target speaker audio and/or the target text content is marked according to the target user information to generate the conference record of the target audio. The target speaker audio is one of the at least one speaker audio, so that the processing steps for the target speaker audio are executed for the at least one speaker audio, and the marking of the user information of the corresponding user for each speaker audio and/or the corresponding text content can be realized, so that the obtained conference record can display the user information of the corresponding speaker while displaying each speaker audio and/or the corresponding text content.

In some embodiments, after tagging the targeted speaker audio according to the targeted user information associated with the second digitized tag, the targeted character information may also be generated based on the targeted speaker audio, and the association shows the targeted user information and the targeted character information (i.e., the text content corresponding to the targeted speaker audio).

Exemplarily, as shown in fig. 4, a schematic page for displaying a conference record provided by an embodiment of the present disclosure is shown, where a conference video picture is displayed on the page, and a text record (i.e., text content) corresponding to the conference is displayed, and for each speaker corresponding text content, an avatar and a name of a user corresponding to the speaker are marked, and as can be seen from fig. 4, text content, an avatar and a name corresponding to two speakers are shown in the text record, including: the speaker named "gadget a", and the text content of "gadget a" speaking in the conference, as well as the avatar of gadget a, the speaker named "gadget B", and the text content of "gadget B" speaking in the conference, as well as the avatar of gadget B.

In the above embodiment, the target speaker audio can be labeled by using the target user information, and the target user information and the target character information generated based on the target speaker audio can be displayed in an associated manner, so that when a user queries a conference record corresponding to the target audio in a conference scene, the user can know which recorded character information is which user, and thus the labeling effect of the conference record is clearer and the human-computer interaction performance is better.

307. And under the condition that the first digital mark is not matched with all the registered digital marks, marking the audio frequency of the target speaker by adopting the identification of the target speaker.

The target speaker identification is used for distinguishing a speaker corresponding to the audio of the target speaker from other speakers, wherein the other speakers are speakers corresponding to the audio of the at least one speaker except the audio of the target speaker.

In the case that some users do not register a digitized mark, when the first digitized mark does not match with all the registered digitized marks, the target speaker audio may be marked by using the target speaker identifier, for example, the target speaker audio may be marked as a speaker n, where n is an integer greater than or equal to 1.

As shown in fig. 5, for another schematic page for displaying a conference record provided in this disclosure, assuming that two speakers exist in a conference and one of the user information is determined, when text content corresponding to the speaker can be displayed, the head portrait and the name of the user corresponding to the speaker are marked, such as the speaker named "little B" in fig. 5, text content of the "little B" speaking in the conference, and the head portrait of the little B; however, if the user information of another speaker is not obtained, the text content corresponding to the speaker may be marked with the target speaker identifier 51 "speaker 1-meeting room a" as shown in fig. 5.

The embodiment of the disclosure provides an audio marking method, which comprises the following steps: acquiring a target audio; clustering and splitting the target audio to obtain at least one speaker audio; acquiring a first digital mark according to a target speaker audio, wherein the target speaker audio is one of the at least one speaker audio; and after the registered second digital mark matched with the first digital mark is determined, acquiring target user information associated with the second digital mark, and marking the target speaker audio by adopting the target user information. By the scheme, the first digital mark extracted from the target speaker audio (one speaker audio) in the target audio is matched with the registered digital mark, and when the first digital mark is successfully matched with the second digital mark, the target speaker audio is marked by the target user information associated with the second digital mark.

It should be noted that, the solution shown in fig. 3 relates to registration of the digitized mark and matching of the digitized mark, and in practical applications, before the registration of the digitized mark and matching of the digitized mark are performed, user authorization in a practical scenario may be applied to allow the speaker recognition function to be turned on, and the digitized mark corresponding to the user audio may be obtained.

Fig. 6 is a schematic flow chart of another audio marking method provided by the embodiment of the present disclosure; the audio marking method comprises a stage of activating a speaker identification function (also called as an identification digital marking function), a stage of registering a digital marking and a digital marking matching stage. The method relates to an interaction process among enterprise tenant management equipment, enterprise employee equipment and a server.

601. The enterprise tenant management equipment has a speaker recognition function.

The conference application can set an option for turning on/off the speaker recognition function, and the speaker recognition function is turned off by default before a user does not operate.

And triggering an option for activating the speaker recognition function by the enterprise tenant administrator to activate the speaker recognition function.

602. The enterprise tenant management equipment selects enterprise employees and sends invitation messages for opening the speaker recognition function to the enterprise employee equipment of the selected enterprise employees.

Wherein, the invitation message includes: aiming at the introduction information of the speaker recognition function, the user is reminded that the function can obtain a digital mark corresponding to the audio frequency of the user, and the user can select whether to allow the function.

The enterprise tenant management equipment selects enterprise employees through the conference application and sends invitation messages for opening the speaker recognition function to the enterprise employee equipment of the selected enterprise employees.

603. And the enterprise employee equipment receives the invitation message sent by the enterprise tenant management equipment.

And the enterprise employee equipment receives the invitation message sent by the enterprise tenant management equipment through the conference application. And triggering the option of activating the speaker recognition function by enterprise staff to activate the speaker recognition function. It can be understood that even if the enterprise tenant administrator triggers the option of activating the speaker recognition function, the activation of the speaker recognition function can be realized only by obtaining personal consent of the enterprise employees.

After the speaker recognition function is opened by the enterprise staff, the speaker recognition function can be closed at any time, and the registered digital mark can be emptied.

604. And receiving the operation of opening the identification digital marking function by the user through the enterprise employee equipment.

605. And the enterprise employee equipment sends a feedback message of successful activation of the identification digital marking function to the server.

606. The server may obtain first meeting audio that enables user historical participation in the digital tagging capability.

After the enterprise employee equipment sends a feedback message that the identification digital marking function is successfully opened to the server, the server can acquire first conference audio of historical participation of the user who opens the identification digital marking function.

607. It is determined whether the speaker of the first conference audio is unique.

608. In the event that the speaker of the first conference audio is unique, a second digitized token is obtained from the first conference audio.

609. And correspondingly storing the second digital mark and the user information of the user to finish the registration of the digital mark.

610. And under the condition that the speaker of the first conference audio is not unique, extracting the second digital mark through the conference audio of the subsequent conference, and completing registration.

Fig. 7 is a schematic flow chart illustrating a process of performing digital signature registration according to an embodiment of the present disclosure. Acquiring historical first conference audio, if the first conference audio is single-channel audio (namely a speaker of the first conference audio is unique), extracting a digital mark based on the first conference audio to obtain a second digital mark, if s digital marks are stored currently, performing cross check on the second digital mark and the s digital marks, judging a check result, if the check result is consistent, completing registration based on the second digital mark, and sending a message to a front end after the registration is successful to prompt that a user has successfully registered the digital marks. If the verification results are inconsistent, the second digitized token may be temporarily stored. Wherein s is an integer greater than or equal to 1.

611. The server acquires a target audio.

The server can receive the target audio sent by the conference participating equipment corresponding to any target audio.

612. The server carries out clustering splitting on the target audio to obtain at least one speaker audio.

In some embodiments, in a case where the speaker recognition function has been activated for the participant corresponding to the schedule information of the conference corresponding to the target audio, the following 613 to 619 may be performed.

In some embodiments, in a case that the speaker recognition function has been turned on by the user corresponding to the participant device of the conference corresponding to the target audio, the following 613 to 619 may be performed.

613. The server obtains a first digitized mark from the audio of the target speaker.

Wherein the targeted speaker audio is one of the at least one speaker audio.

614. The server matches the first digitized token with the registered digitized tokens.

The first digitized mark may be matched with registered digitized marks stored in a digitized mark library.

The matching range may include the following cases:

case 1: the conference comprises a plurality of enterprise tenants, and the matching range can comprise: the digital marks of the participants in the meeting schedule of the enterprise tenant where the meeting organizer is located, and the digital marks of the users corresponding to the current conference-entering equipment.

Case 2: for a business portfolio meeting, for a reservation meeting, the matching scopes may include: the digital marks of the participants in the conference schedule of the reserved conference and the digital marks of the corresponding users of the current conference-joining equipment can be used.

Case 3: for a business portfolio meeting, for an instant meeting, the matching scopes may include: the current conferencing device corresponds to the user's digitized label.

615. And in the case that the first digital mark is matched with the registered second digital mark, acquiring the target user information associated with the second digital mark.

If the matching is successful, the name and the head portrait of the corresponding user are displayed; the unmatched speakers are presented in the form of speaker n and conference room name.

616. The targeted speaker audio is tagged according to the targeted user information (speaker name, avatar, etc.) to generate a meeting record.

For the descriptions of 611 to 616, reference may be made to the above description in fig. 3, and details will not be further described here.

617. And sending the meeting record to enterprise employee equipment.

618. The enterprise employee device displays a meeting record in which the speaker name, avatar, etc. may be displayed.

Further, the information representing the speaker's avatar and name may be derived from the digitized tag identification by displaying a specific identifier, such as the specific identifier 41 displayed in the lower right corner of the avatar to represent the speaker's avatar and name from the digitized tag identification, as shown in FIG. 4.

619. And determining/editing information such as the name and head portrait of the speaker in the conference record.

Furthermore, an administrator of the enterprise tenant can open the editing authority of the user information for a part of users.

When the user who opens the editing authority of the user information views the conference record, prompt information can be output to prompt the user to modify the displayed information such as the speaker name, the head portrait and the like. The user can make the modification by clicking the speaker name, the avatar, and the like.

When the user who does not have the editing authority of opening the user information checks the conference record, and the user who does not have the editing authority of opening the user information aims at the identification result, the user can select to delete the information such as the name, the head portrait and the like of the user, which is obtained by identifying the digital mark.

Furthermore, after the user can select to delete the information such as the name and the head portrait of the speaker identified by the digital mark, the speaker information corresponding to the user is changed into the name of the n-conference room of the speaker.

As shown in fig. 8, a schematic flow chart of a process for performing matching of digitized markers is provided in an embodiment of the present disclosure, in which conference room audio (i.e. target audio) is first obtained, then speaker clustering (also referred to as SSD technology) is performed on the audio in a conference to divide the conference room audio into one or more speaker audios, then it is determined whether the current tenant of the enterprise starts digitized marker recognition, if the digitized marker recognition is not started, then speaker n and a conference room name can be used to mark different speakers in the conference room audio, if the digitized marker recognition is started, then digitized markers corresponding to conference participants are obtained from a digitized marker library, then the digitized markers of the one or more speaker audios obtained by dividing the conference room audio are respectively matched with the digitized markers corresponding to the conference participants, whether the digitized marks corresponding to the conference participants are matched or not is judged, for the speaker audio frequency which is successfully matched, the user information mark of the conference participants can be adopted when the conference record corresponding to the speaker audio frequency can be displayed, for the speaker audio frequency which is not successfully matched, the speaker n and the conference room name can be adopted for marking when the conference record corresponding to the speaker audio frequency is displayed, and for example, the mark of 'speaker 1-conference room name' can be adopted.

As shown in fig. 9, an embodiment of the present disclosure provides a block diagram of an audio tagging device, where the device includes:

an obtaining module 901, configured to obtain a target audio;

a clustering module 902, configured to perform clustering and splitting on the target audio to obtain at least one speaker audio;

a tag extraction module 903, configured to obtain a first digital tag according to a target speaker audio, where the target speaker audio is a speaker audio in the at least one speaker audio;

a matching module 904 for determining a registered second digitized token that matches the first digitized token;

a tagging module 905 configured to tag the audio of the target speaker according to the target user information associated with the second digitized tag.

As an optional implementation manner of the embodiment of the present disclosure, the target audio is a conference audio, and the matching module 904 is specifically configured to:

As an optional implementation manner of the embodiment of the present disclosure, the matching module 904 is further configured to:

As an optional implementation manner of the embodiment of the present disclosure, the target audio is a conference audio, and the apparatus further includes:

a registering module 906, configured to obtain a first audio corresponding to the target meeting device from the first meeting audio before the matching module 904 determines that the first digital signature matches the registered second digital signature;

As an optional implementation manner of the embodiment of the present disclosure, the marking module 905 is further configured to:

As an optional implementation manner of the embodiment of the present disclosure, the apparatus further includes: a display module 907, configured to generate target text information based on the target speaker audio after the tagging module 905 tags the target speaker audio according to the target user information associated with the second digital tag; and displaying the target user information and the target character information in an associated manner.

As shown in fig. 10, an embodiment of the present disclosure provides a hardware structure diagram of an electronic device, where the electronic device includes: a processor 1001, a memory 1002 and a computer program stored on the memory 1002 and executable on the processor 1001, which computer program, when executed by the processor 1001, implements the respective processes of the audio tagging method in the above-described method embodiments. And the same technical effect can be achieved, and in order to avoid repetition, the description is omitted.

The embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the audio tagging method in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The embodiments of the present disclosure provide a computer program product, where the computer program is stored, and when being executed by a processor, the computer program implements each process of the audio frequency tagging method in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

As will be appreciated by one of skill in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

In the present disclosure, the Processor may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In the present disclosure, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

In the present disclosure, computer-readable media include both non-transitory and non-transitory, removable and non-removable storage media. Storage media may implement information storage by any method or technology, and the information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An audio tagging method, comprising:

acquiring a target audio;

clustering and splitting the target audio to obtain at least one speaker audio;

2. The method of claim 1, wherein the target audio is conference audio, and wherein determining the registered second digitized token that matches the first digitized token comprises:

3. The method of claim 1, wherein the target audio is conference audio, and wherein prior to determining the registered second digitized token that matches the first digitized token, the method further comprises:

determining at least one participant device corresponding to the target audio;

4. The method of claim 1, wherein the target audio is conference audio, and wherein prior to determining the registered second digitized token that matches the first digitized token, the method further comprises:

5. The method of claim 4, further comprising:

6. The method of claim 1, further comprising:

7. The method of any of claims 1-6, wherein after tagging the targeted speaker audio based on the targeted user information associated with the second digitized tag, the method further comprises:

generating target character information based on the target speaker audio;

8. An audio tagging device, comprising:

the acquisition module is used for acquiring a target audio;

and the marking module is used for marking the audio frequency of the target speaker according to the target user information associated with the second digital mark.

9. An electronic device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the audio tagging method of any of claims 1 to 8.

10. A computer-readable storage medium, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the audio tagging method of any of claims 1 to 8.