CN110648667A

CN110648667A - Multi-person scene human voice matching method

Info

Publication number: CN110648667A
Application number: CN201910918342.4A
Authority: CN
Inventors: 唐立军; 杨家全; 周年荣; 张林山; 李浩涛; 杨洋; 冯勇; 严玉廷; 李孟阳; 罗恩博; 梁俊宇; 袁兴宇; 李响; 何婕; 栾思平
Original assignee: Electric Power Research Institute of Yunnan Power System Ltd
Current assignee: Electric Power Research Institute of Yunnan Power System Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2020-01-03
Anticipated expiration: 2039-09-26
Also published as: CN110648667B

Abstract

The embodiment of the application provides a method for matching voices in a multi-person scene, which comprises the following steps: dividing audio to be matched into a plurality of sound segments; performing voice recognition on the voice segments to obtain voice segments in the voice segments; acquiring a video clip corresponding to the voice clip; carrying out face detection on the video segments to obtain all predicted speakers of the voice segments; according to the pixel difference value of adjacent gray frames in the video clip, obtaining the hit information of each predicted speaker in the adjacent gray frames; and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment. The method and the device have the advantages that the voice is automatically bound to the target speaker, the workload of matching the voice and the target speaker manually in the follow-up process can be greatly reduced, and the practicability of the visual and auditory perception technology is promoted.

Description

Multi-person scene human voice matching method

Technical Field

The application relates to the technical field of voice matching, in particular to a method for matching voices in a multi-person scene.

Background

With the continuous development of natural language processing technology, the speech recognition function of converting sound into characters is continuously improved, but in some multi-person conversation scenes, such as multi-person conference records and interview summaries, besides the need of converting sound into characters, the identity of a speaker needs to be recognized, so that the matching of sound and the voice of the speaker is realized, and the content of the conference records or the interview summaries can be completely recorded.

In the related art, different speakers can be distinguished by adopting a voiceprint recognition technology, however, the voiceprint recognition needs to collect a section of voice of each speaker in advance to extract the voice characteristics of the speaker as the basis of the voiceprint recognition, which does not have the realization condition in some multi-person conversation scenes. For the audios and videos needing voice matching, voice contents are mostly matched with specific speakers through manual listening, distinguishing and sorting, the workload of manual voice matching is large, the phenomena of mislistening and misremembering are very easy to occur, and the matching effect is poor.

Disclosure of Invention

The application provides a method for matching voices in a multi-person scene, which aims to solve the problem of automatic matching of voices in the multi-person scene.

The application provides a method for matching voices in a multi-person scene, which comprises the following steps:

dividing audio to be matched into a plurality of sound segments;

performing voice recognition on the sound segments to obtain voice segments in the sound segments;

acquiring a video clip corresponding to the voice clip;

carrying out face detection on the video segment to obtain all predicted speakers of the voice segment;

obtaining hit information of each predicted speaker in the adjacent gray frames according to pixel difference values of the adjacent gray frames in the video clips;

and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.

Optionally, performing speech recognition on the sound segment to obtain a speech segment in the sound segment, including:

performing voice recognition on the sound fragment to obtain characters or words in the sound fragment, and forming all the characters or words of the sound fragment into a voice sequence;

counting the number of words in the voice sequence, and judging whether the number of words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration;

and if the number of the words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration, judging the sound segment to be a voice segment.

Optionally, performing face detection on the video segment to obtain all predicted speakers of the voice segment, including:

extracting a plurality of frames to be processed from the video clip to form a frame sequence to be processed;

performing face detection on the frame sequence to be processed to obtain all faces in the frame to be processed, and constructing a face sequence according to all the faces of the video segments;

marking a face area corresponding to the face sequence by using a rectangular frame, and acquiring a marking coordinate of the rectangular frame and a central coordinate of the rectangular frame;

clustering all face sequences of the video segments according to the central coordinates to obtain face categories of the video segments, wherein each category of the face categories corresponds to a predicted speaker;

and establishing a corresponding relation between the face sequence and the face category to obtain all predicted speakers of the voice segment.

Optionally, extracting a plurality of frames to be processed from the video segment to form a sequence of frames to be processed includes: taking a first frame of the video segment as a starting frame, extracting one frame from the starting frame at intervals of a preset number of frames, taking the starting frame and the extracted frame as frames to be processed, and constructing a frame sequence to be processed according to all the frames to be processed.

Optionally, the mark coordinates of the rectangular frame include an X coordinate of a pixel point at the upper left corner of the rectangular frame, a Y coordinate of a pixel point at the upper left corner of the rectangular frame, a width coordinate of the rectangular frame, and a height coordinate of the rectangular frame, the center coordinates of the rectangular frame include a center X coordinate and a center Y coordinate, the center X coordinate is one half of the width coordinate of the rectangular frame, and the center Y coordinate is one half of the height coordinate of the rectangular frame.

Optionally, obtaining hit information of each predicted speaker in the adjacent gray-scale frame according to a pixel difference value of the adjacent gray-scale frame in the video segment, includes:

converting the frame sequence to be processed into a gray frame sequence;

calculating the absolute value of the gray difference value of each pixel point of the adjacent gray frames in the gray frame sequence;

judging whether the absolute value is larger than a preset threshold value or not;

if the absolute value is larger than a preset threshold value, marking the pixel point as 1;

if the absolute value is less than or equal to a preset threshold value, marking the pixel point as 0;

obtaining a frame difference image sequence corresponding to the frame sequence to be processed according to the marking value of the pixel point;

counting the number of pixels with the mark value of 1 in each rectangular frame of each frame of the frame difference image sequence;

and judging the predicted speaker hit corresponding to the rectangular frame once according to the condition that the number of the pixels is within a preset range.

Optionally, counting hit times of each predicted speaker in the video segment, where the predicted speaker with the largest hit time is a target speaker of the voice segment, includes:

counting the hit times of each predicted speaker in all frames of the frame difference image sequence;

and comparing the hit times of each predicted speaker, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.

The method for matching the voices in the multi-user scene has the advantages that:

the multi-person scene voice matching method provided by the embodiment of the application is based on voice recognition, face detection and clustering algorithm, the voice is automatically bound to the affiliated target speaker, voice characteristic information such as voiceprints of each person does not need to be collected in advance, the workload of subsequent manual matching of the voice and the target speaker is greatly reduced, and the practicability of the visual and auditory perception technology is promoted. The method is particularly suitable for meeting scenes and can be used as a basis for automatically extracting the speech abstract and the viewpoint of each participant in the follow-up process.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic flowchart of a method for matching voices in a multi-person scene according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for screening a speech segment according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for predicting speaker acquisition according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating a method for predicting speaker hit according to an embodiment of the present disclosure;

fig. 5 is a flowchart illustrating a target speaker obtaining method according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a schematic flow chart of a method for matching voices in a multi-person scene provided in an embodiment of the present application is shown in fig. 1, where the method for matching voices in a multi-person scene provided in an embodiment of the present application includes the following steps:

step S110: the audio to be matched is divided into a plurality of sound segments.

The audio to be matched can be an audio clip extracted from a section of video clip, and can also be an audio clip corresponding to the video clip.

The audio to be matched is segmented according to the constraint condition that each voice segment only belongs to a unique person, so that a unique speaker of each voice segment can be obtained in subsequent processing.

In this embodiment, the audio to be matched is divided by the trough of the sound wave waveform, for example, the audio corresponding to a 10-minute conference video may be used as the source of the audio to be matched, and in the audio to be matched, the audio to be matched is divided into a plurality of sound segments Voice through the trough region where the sound wave amplitude is smaller than the preset amplitude threshold and the duration is longer than the preset time threshold_mWherein 1. ltoreq. M. ltoreq.M (M denotes the total number of sound segments).

The division method enables a segment of Voice fragment Voice_mIn the method, at most one person speaks, and the Voice fragment Voice of the speaking person_mCalled Speech fragment Speech_iI is more than or equal to 0 and less than or equal to m. Each sound clip Voice_mThe corresponding speaker is called the Voice fragment Voice_mThe target speaker of (1).

Step S120: and performing voice recognition on the sound segments to obtain the voice segments in the sound segments.

Referring to fig. 2, a schematic flow chart of a speech segment screening method provided in the embodiment of the present application is shown, and as shown in fig. 2, the speech segment screening method provided in the embodiment of the present application includes the following steps:

step S201: and performing voice recognition on the voice fragment to obtain the characters or words in the voice fragment, and forming a voice sequence by all the characters or words in the voice fragment.

Each sound fragment Voice is transmitted by a Voice recognition technology_mSpeech sequence S converted into a word and a word_mThe process can be denoted as TransV (Voice)_m)＝S_m。

For example, S1 is now started, S2 is one, S3 is 0, S4 is one of li, …, and S10 is the end of the conference. Wherein S3 is 0, which represents the audio clip Voice₃No human voice is recognized.

Step S202: and counting the number of words in the voice sequence, and judging whether the number of words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration.

Speech sequence S_mThe number of Chinese characters and phrases isIs expressed, the process is recorded as

In this example, W₁ ^c＝4，

W₃ ^c＝0，

The speech rate adjustment factor is represented by α, which may be a value in the range of (0, 1). Reference speech rate by S^RExpressed as a reference value of the number of words spoken by the person per unit time. Segment duration usage

And (4) showing.

Judgment of

Whether or not conditions are satisfied

Step S203: and if the number of the words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration, judging the sound segment as the voice segment.

In this embodiment, S1, S4, S10 satisfy

The remaining 7 speech sequences do not satisfy

The Voice segments Voice represented by S1, S4, S10 are determined_m(m-1, 4, 10) is a Speech segment Speech_i(i＝1，2，3)。

After voice content in the voice segments represented by S1, S2, S4 and S10 is recognized by a voice recognition algorithm, then voice rate detection is performed in step S202, S1, S2, S4 and S10 are further screened, S2 is screened to be not in accordance with normal voice rate, S2 may be environmental noise which is erroneously recognized, or conference voice which is not in accordance with reference voice rate, and the like, for example, a conference voice which is faster in accordance with the voice rate of a participant may be chatted in private, and S2 is screened, and voice matching is performed only on S1, S4 and S10, so that matching quality can be improved, and matching efficiency can be improved.

As can be seen, step S120 converts each sound segment into a sequence S of words and phrases by speech recognition techniques_mUsing conditional constraints S_mTherefore, voice segments are screened out, and further, words identified by voice can be used as materials for automatic recording and summarization of later conferences.

Step S130: and acquiring a video clip corresponding to the voice clip.

Video clip Video_iWherein i is Speech_iCorresponding to the number of the video clip, the video clips corresponding to the voice clips of S1, S4, and S10 are obtained as 1 st minute, 4 th minute, and 10 th minute, and i is 1, 2, and 3, respectively.

Step S140: and carrying out face detection on the video segment to obtain all predicted speakers of the voice segment.

Referring to fig. 3, a schematic flowchart of a predicted speaker acquiring method provided in the embodiment of the present application is shown, and as shown in fig. 3, the predicted speaker acquiring method provided in the embodiment of the present application includes the following steps:

step S401: and extracting a plurality of frames to be processed from the video clip to form a frame sequence to be processed.

Video clip Video_iThe number of all frames in the sequence is marked as

Wherein K is a frame number, K is more than or equal to 1 and less than or equal to K_i，Video_iHas a starting frame number of 1, and the total number of frames is K_i。

And extracting one frame from the starting frame at intervals of a preset number of frames, taking the starting frame and the extracted frames as frames to be processed, and constructing a frame sequence to be processed according to all the frames to be processed.

The number of the extracted frame satisfies

k_sAnd k_DAre integers which are more than or equal to 1 and respectively represent the start frame and the frame extraction interval of the frame extraction,

is an integer of 0 or more and is,

order to

The sequence of frames to be processed thus constituted by the finally extracted frames is thus that of

In this embodiment, the first frame is used as the start frame, and one frame is extracted at intervals of 2 frames to obtain a frame sequence to be processed

Step S402: and carrying out face detection on the frame sequence to be processed to obtain all faces in the frame to be processed, and constructing a face sequence according to all the faces of the video segments.

Detecting each extracted frame by using a face detection algorithm

Face of (5), constructing a face sequenceRepresenting the p-th face of the k-th frame.

Step S403: and marking the face area corresponding to the face sequence by using a rectangular frame, and acquiring the marking coordinates of the rectangular frame and the central coordinates of the rectangular frame.

In the frame image, a face region obtained by face detection is marked by a rectangular frame, the rectangular frame is obtained according to a typical Adaboost algorithm in the embodiment of the application, and eyebrows are raised on the rectangular frame, and eyebrows are lowered to a chin, and the left and the right are trimmed. The coordinates of the marks of the rectangular frame are noted

Respectively representing the X and Y coordinates of the upper left corner of the rectangular box and the width and height of the rectangle,

x coordinates of pixel points at the upper left corner of the representation frame,Y coordinate representing pixel point at upper left corner of rectangular frame,

The width coordinate of the rectangular frame,

Indicating the height coordinate of the rectangular box.

The center coordinates of the rectangular frame are noted

Represents the center X coordinate, which is one-half of the width coordinate of the rectangular frame,

represents the center Y coordinate, which is one-half of the height coordinate of the rectangular box.

The central coordinates of the rectangular frame are used for representing the center of the face, different faces, different sizes of the rectangular frame areas and different positions of the central coordinates.

Step S404: and clustering all face sequences of the video segments according to the center coordinates to obtain the face categories of the video segments.

Video clip Video_iCentral coordinates of all face sequences recognized in the sequence of frames to be processed

A K-means clustering algorithm is performed. The center coordinates of the rectangular frames of the face of the same person in different frame sequences to be processed have some differences, but the differences are usually within a certain range, and the differences of the center coordinates of the rectangular frames of the faces of different persons are relatively large, so that K clustering centers can be obtained according to a K-means clustering algorithm to represent Video segments Video_iThere are Q predicted speakers, each of which is recorded as

(q≤Q，Q＝K)。

Step S405: and establishing a corresponding relation between the face sequence and the face category to obtain all predicted speakers of the voice segments.

Faces belonging to the same cluster center

Assigning identical speaker tagsSpeaker tag

May be a predicted speaker name. Each face

Predicting the speaker in response to the unique predicted speaker

Multiple faces corresponding to the same person in multiple framesAfter establishing the corresponding relationship, the method comprises

Can search

By passing

Can also find out

Step S140, detecting the face in each extracted frame through a face detection algorithm to obtain a face sequence, and recording a face rectangular frame and a center coordinate; the speaker label is obtained by executing a clustering algorithm on the face center coordinate, so that the situation that different predicted speakers are distinguished by complex methods such as face detection and face tracking is avoided, the workload is low, and the efficiency is improved.

Step S150: and obtaining the hit information of each predicted speaker in the adjacent gray frames according to the pixel difference value of the adjacent gray frames in the video clip.

Only one of the Q predicted speakers is a real speaker, called a target speaker, and the only target speaker is obtained by performing hit analysis on the predicted speakers. Referring to fig. 4, a flowchart of a method for predicting a hit of a speaker provided in an embodiment of the present application is shown in fig. 4, where the method for predicting a hit of a speaker includes the following steps:

step S501: the sequence of frames to be processed is converted into a sequence of gray frames.

Combining a sequence of frames

Conversion to a sequence of grayscale frames

Step S502: and calculating the absolute value of the gray difference value of each pixel point of the adjacent gray frames in the gray frame sequence.

In a sequence of gray frames

And calculating the absolute value of the gray difference value of each pixel point of the current frame and the previous frame.

Step S503: and judging whether the absolute value is larger than a preset threshold value or not.

The preset threshold is used for judging whether the human face moves, for example, if the absolute values of the gray difference values of the pixel points of the lip region of the human face in the adjacent gray frames are all larger than the preset threshold, it is judged that the lip region moves, the probability that the predicted speaker corresponding to the human face is the target speaker is increased, and if the human face in a plurality of adjacent gray frames moves all the time, the predicted speaker corresponding to the human face has a higher probability of being the target speaker. And determining the size of the preset threshold according to the functional requirements.

Step S504: and if the absolute value is larger than the preset threshold value, marking the pixel point as 1.

Step S505: and if the absolute value is less than or equal to the preset threshold value, marking the pixel point as 0.

Step S506: and obtaining a frame difference image sequence corresponding to the frame sequence to be processed according to the marking values of the pixel points.

Sequence of frame difference mapsIn (2), the value of each pixel is 0 or 1.

Step S507: and counting the number of pixels with the mark value of 1 in each rectangular frame of each frame of the frame difference image sequence.

The rectangular box represents the face region, detected

Number of pixels having median value of 1

To determine whether the predicted speaker is speaking.

Step S508: and judging the predicted speaker hit corresponding to the rectangular frame once according to the condition that the number of the pixels is within a preset range.

When in useTime, judge

Hit once. Wherein the content of the first and second substances,

andrespectively, the minimum and maximum values of the preset range. Current number of pixels

Less than a minimum value

When the change of the face area in the adjacent frames is small, the face is judged not to be speaking, for example, the face can be in a state of listening to the speaking of other people; current number of pixels

Greater than maximum

And then, the change of the face area in the adjacent frames is larger, and the face is judged not to have speech, for example, the face area is probably in a laugh state, and the change range of the pixel points of the face area is larger.

In step S150, the number of pixels having a frame difference value of 1 in the face region is counted by calculating a frame difference map sequence corresponding to the gray frame sequence, a constraint condition satisfying the hit is set, and the face region is used as the constraint condition, so that the anti-interference capability of the algorithm is improved.

Step S160: and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.

The hit analysis according to step S150 may further determine who is the target speaker. Referring to fig. 5, a schematic flowchart of a target speaker acquiring method according to an embodiment of the present application is shown, and as shown in fig. 5, the target speaker acquiring method according to the embodiment of the present application includes the following steps:

step S601: and counting the hit times of each predicted speaker in all frames of the frame difference image sequence.

Video over the entire Video clip_iThe number of hits of the speaking speaker should be the largest, so the number of hits of each predicted speaker in all frames of the frame difference map sequence is counted separately.

Step S602: and comparing the hit times of each predicted speaker, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.

With the highest total number of hits

Namely Speech_iIf two predicted speakers with the largest hit frequency appear, the target speaker may be further identified from the two predicted speakers manually.

It can be seen from the foregoing embodiments that the multi-person scene voice matching method provided in the embodiments of the present application automatically binds voices to the belonging target speakers based on voice recognition, face detection, and clustering algorithms, and does not need to collect voice feature information such as voiceprints of each person in advance, thereby greatly reducing the workload of manually matching voices and target speakers subsequently, facilitating the promotion of the practicability of visual and auditory perception technologies, being particularly suitable for conference scenes, and being used as a basis for automatically extracting the speaking abstract and viewpoint of each participant subsequently.

Since the above embodiments are all described by referring to and combining with other embodiments, the same portions are provided between different embodiments, and the same and similar portions between the various embodiments in this specification may be referred to each other. And will not be described in detail herein.

It is noted that, in this specification, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the presence of an element identified by the phrase "comprising an … …" does not exclude the presence of other like elements in a circuit structure, article or device comprising the element.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

The above-described embodiments of the present application do not limit the scope of the present application.

Claims

1. A method for matching human voice in a multi-person scene is characterized by comprising the following steps:

dividing audio to be matched into a plurality of sound segments;

acquiring a video clip corresponding to the voice clip;

2. The method for matching human voice in multi-person scene according to claim 1, wherein performing voice recognition on the sound segments to obtain the voice segments in the sound segments comprises:

3. The method for matching human voice in multi-person scene as claimed in claim 1, wherein the step of performing face detection on the video segment to obtain all predicted speakers of the voice segment comprises:

4. The multi-person scene human voice matching method according to claim 3, wherein extracting a plurality of frames to be processed from the video segment to form a sequence of frames to be processed comprises: taking a first frame of the video segment as a starting frame, extracting one frame from the starting frame at intervals of a preset number of frames, taking the starting frame and the extracted frame as frames to be processed, and constructing a frame sequence to be processed according to all the frames to be processed.

5. The method for matching human voice in a multi-person scene according to claim 3, wherein the labeled coordinates of the rectangular frame comprise an X coordinate of a pixel point at the upper left corner of the rectangular frame, a Y coordinate of a pixel point at the upper left corner of the rectangular frame, a width coordinate of the rectangular frame, and a height coordinate of the rectangular frame, the center coordinates of the rectangular frame comprise a center X coordinate and a center Y coordinate, the center X coordinate is one half of the width coordinate of the rectangular frame, and the center Y coordinate is one half of the height coordinate of the rectangular frame.

6. The method as claimed in claim 3, wherein obtaining hit information of each predicted speaker in the adjacent gray frames according to pixel difference values of the adjacent gray frames in the video segment comprises:

converting the frame sequence to be processed into a gray frame sequence;

7. The method for matching human voice in multi-person scene as claimed in claim 6, wherein counting the hit times of each predicted speaker in the video segment, the predicted speaker with the largest hit time being the target speaker of the voice segment, comprises: