CN110648667A - Multi-person scene human voice matching method - Google Patents

Multi-person scene human voice matching method Download PDF

Info

Publication number
CN110648667A
CN110648667A CN201910918342.4A CN201910918342A CN110648667A CN 110648667 A CN110648667 A CN 110648667A CN 201910918342 A CN201910918342 A CN 201910918342A CN 110648667 A CN110648667 A CN 110648667A
Authority
CN
China
Prior art keywords
voice
frame
speaker
sequence
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910918342.4A
Other languages
Chinese (zh)
Other versions
CN110648667B (en
Inventor
唐立军
杨家全
周年荣
张林山
李浩涛
杨洋
冯勇
严玉廷
李孟阳
罗恩博
梁俊宇
袁兴宇
李响
何婕
栾思平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of Yunnan Power System Ltd
Original Assignee
Electric Power Research Institute of Yunnan Power System Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of Yunnan Power System Ltd filed Critical Electric Power Research Institute of Yunnan Power System Ltd
Priority to CN201910918342.4A priority Critical patent/CN110648667B/en
Publication of CN110648667A publication Critical patent/CN110648667A/en
Application granted granted Critical
Publication of CN110648667B publication Critical patent/CN110648667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a method for matching voices in a multi-person scene, which comprises the following steps: dividing audio to be matched into a plurality of sound segments; performing voice recognition on the voice segments to obtain voice segments in the voice segments; acquiring a video clip corresponding to the voice clip; carrying out face detection on the video segments to obtain all predicted speakers of the voice segments; according to the pixel difference value of adjacent gray frames in the video clip, obtaining the hit information of each predicted speaker in the adjacent gray frames; and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment. The method and the device have the advantages that the voice is automatically bound to the target speaker, the workload of matching the voice and the target speaker manually in the follow-up process can be greatly reduced, and the practicability of the visual and auditory perception technology is promoted.

Description

Multi-person scene human voice matching method
Technical Field
The application relates to the technical field of voice matching, in particular to a method for matching voices in a multi-person scene.
Background
With the continuous development of natural language processing technology, the speech recognition function of converting sound into characters is continuously improved, but in some multi-person conversation scenes, such as multi-person conference records and interview summaries, besides the need of converting sound into characters, the identity of a speaker needs to be recognized, so that the matching of sound and the voice of the speaker is realized, and the content of the conference records or the interview summaries can be completely recorded.
In the related art, different speakers can be distinguished by adopting a voiceprint recognition technology, however, the voiceprint recognition needs to collect a section of voice of each speaker in advance to extract the voice characteristics of the speaker as the basis of the voiceprint recognition, which does not have the realization condition in some multi-person conversation scenes. For the audios and videos needing voice matching, voice contents are mostly matched with specific speakers through manual listening, distinguishing and sorting, the workload of manual voice matching is large, the phenomena of mislistening and misremembering are very easy to occur, and the matching effect is poor.
Disclosure of Invention
The application provides a method for matching voices in a multi-person scene, which aims to solve the problem of automatic matching of voices in the multi-person scene.
The application provides a method for matching voices in a multi-person scene, which comprises the following steps:
dividing audio to be matched into a plurality of sound segments;
performing voice recognition on the sound segments to obtain voice segments in the sound segments;
acquiring a video clip corresponding to the voice clip;
carrying out face detection on the video segment to obtain all predicted speakers of the voice segment;
obtaining hit information of each predicted speaker in the adjacent gray frames according to pixel difference values of the adjacent gray frames in the video clips;
and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.
Optionally, performing speech recognition on the sound segment to obtain a speech segment in the sound segment, including:
performing voice recognition on the sound fragment to obtain characters or words in the sound fragment, and forming all the characters or words of the sound fragment into a voice sequence;
counting the number of words in the voice sequence, and judging whether the number of words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration;
and if the number of the words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration, judging the sound segment to be a voice segment.
Optionally, performing face detection on the video segment to obtain all predicted speakers of the voice segment, including:
extracting a plurality of frames to be processed from the video clip to form a frame sequence to be processed;
performing face detection on the frame sequence to be processed to obtain all faces in the frame to be processed, and constructing a face sequence according to all the faces of the video segments;
marking a face area corresponding to the face sequence by using a rectangular frame, and acquiring a marking coordinate of the rectangular frame and a central coordinate of the rectangular frame;
clustering all face sequences of the video segments according to the central coordinates to obtain face categories of the video segments, wherein each category of the face categories corresponds to a predicted speaker;
and establishing a corresponding relation between the face sequence and the face category to obtain all predicted speakers of the voice segment.
Optionally, extracting a plurality of frames to be processed from the video segment to form a sequence of frames to be processed includes: taking a first frame of the video segment as a starting frame, extracting one frame from the starting frame at intervals of a preset number of frames, taking the starting frame and the extracted frame as frames to be processed, and constructing a frame sequence to be processed according to all the frames to be processed.
Optionally, the mark coordinates of the rectangular frame include an X coordinate of a pixel point at the upper left corner of the rectangular frame, a Y coordinate of a pixel point at the upper left corner of the rectangular frame, a width coordinate of the rectangular frame, and a height coordinate of the rectangular frame, the center coordinates of the rectangular frame include a center X coordinate and a center Y coordinate, the center X coordinate is one half of the width coordinate of the rectangular frame, and the center Y coordinate is one half of the height coordinate of the rectangular frame.
Optionally, obtaining hit information of each predicted speaker in the adjacent gray-scale frame according to a pixel difference value of the adjacent gray-scale frame in the video segment, includes:
converting the frame sequence to be processed into a gray frame sequence;
calculating the absolute value of the gray difference value of each pixel point of the adjacent gray frames in the gray frame sequence;
judging whether the absolute value is larger than a preset threshold value or not;
if the absolute value is larger than a preset threshold value, marking the pixel point as 1;
if the absolute value is less than or equal to a preset threshold value, marking the pixel point as 0;
obtaining a frame difference image sequence corresponding to the frame sequence to be processed according to the marking value of the pixel point;
counting the number of pixels with the mark value of 1 in each rectangular frame of each frame of the frame difference image sequence;
and judging the predicted speaker hit corresponding to the rectangular frame once according to the condition that the number of the pixels is within a preset range.
Optionally, counting hit times of each predicted speaker in the video segment, where the predicted speaker with the largest hit time is a target speaker of the voice segment, includes:
counting the hit times of each predicted speaker in all frames of the frame difference image sequence;
and comparing the hit times of each predicted speaker, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.
The method for matching the voices in the multi-user scene has the advantages that:
the multi-person scene voice matching method provided by the embodiment of the application is based on voice recognition, face detection and clustering algorithm, the voice is automatically bound to the affiliated target speaker, voice characteristic information such as voiceprints of each person does not need to be collected in advance, the workload of subsequent manual matching of the voice and the target speaker is greatly reduced, and the practicability of the visual and auditory perception technology is promoted. The method is particularly suitable for meeting scenes and can be used as a basis for automatically extracting the speech abstract and the viewpoint of each participant in the follow-up process.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic flowchart of a method for matching voices in a multi-person scene according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a method for screening a speech segment according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a method for predicting speaker acquisition according to an embodiment of the present disclosure;
fig. 4 is a flowchart illustrating a method for predicting speaker hit according to an embodiment of the present disclosure;
fig. 5 is a flowchart illustrating a target speaker obtaining method according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a schematic flow chart of a method for matching voices in a multi-person scene provided in an embodiment of the present application is shown in fig. 1, where the method for matching voices in a multi-person scene provided in an embodiment of the present application includes the following steps:
step S110: the audio to be matched is divided into a plurality of sound segments.
The audio to be matched can be an audio clip extracted from a section of video clip, and can also be an audio clip corresponding to the video clip.
The audio to be matched is segmented according to the constraint condition that each voice segment only belongs to a unique person, so that a unique speaker of each voice segment can be obtained in subsequent processing.
In this embodiment, the audio to be matched is divided by the trough of the sound wave waveform, for example, the audio corresponding to a 10-minute conference video may be used as the source of the audio to be matched, and in the audio to be matched, the audio to be matched is divided into a plurality of sound segments Voice through the trough region where the sound wave amplitude is smaller than the preset amplitude threshold and the duration is longer than the preset time thresholdmWherein 1. ltoreq. M. ltoreq.M (M denotes the total number of sound segments).
The division method enables a segment of Voice fragment VoicemIn the method, at most one person speaks, and the Voice fragment Voice of the speaking personmCalled Speech fragment SpeechiI is more than or equal to 0 and less than or equal to m. Each sound clip VoicemThe corresponding speaker is called the Voice fragment VoicemThe target speaker of (1).
Step S120: and performing voice recognition on the sound segments to obtain the voice segments in the sound segments.
Referring to fig. 2, a schematic flow chart of a speech segment screening method provided in the embodiment of the present application is shown, and as shown in fig. 2, the speech segment screening method provided in the embodiment of the present application includes the following steps:
step S201: and performing voice recognition on the voice fragment to obtain the characters or words in the voice fragment, and forming a voice sequence by all the characters or words in the voice fragment.
Each sound fragment Voice is transmitted by a Voice recognition technologymSpeech sequence S converted into a word and a wordmThe process can be denoted as TransV (Voice)m)=Sm
For example, S1 is now started, S2 is one, S3 is 0, S4 is one of li, …, and S10 is the end of the conference. Wherein S3 is 0, which represents the audio clip Voice3No human voice is recognized.
Step S202: and counting the number of words in the voice sequence, and judging whether the number of words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration.
Speech sequence SmThe number of Chinese characters and phrases isIs expressed, the process is recorded as
Figure BDA0002216768730000032
In this example, W1 c=4,
Figure BDA0002216768730000033
W3 c=0,
Figure BDA0002216768730000034
The speech rate adjustment factor is represented by α, which may be a value in the range of (0, 1). Reference speech rate by SRExpressed as a reference value of the number of words spoken by the person per unit time. Segment duration usage
Figure BDA0002216768730000035
And (4) showing.
Judgment of
Figure BDA0002216768730000036
Whether or not conditions are satisfied
Figure BDA0002216768730000037
Step S203: and if the number of the words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration, judging the sound segment as the voice segment.
In this embodiment, S1, S4, S10 satisfy
Figure BDA0002216768730000038
The remaining 7 speech sequences do not satisfy
Figure BDA0002216768730000039
The Voice segments Voice represented by S1, S4, S10 are determinedm(m-1, 4, 10) is a Speech segment Speechi(i=1,2,3)。
After voice content in the voice segments represented by S1, S2, S4 and S10 is recognized by a voice recognition algorithm, then voice rate detection is performed in step S202, S1, S2, S4 and S10 are further screened, S2 is screened to be not in accordance with normal voice rate, S2 may be environmental noise which is erroneously recognized, or conference voice which is not in accordance with reference voice rate, and the like, for example, a conference voice which is faster in accordance with the voice rate of a participant may be chatted in private, and S2 is screened, and voice matching is performed only on S1, S4 and S10, so that matching quality can be improved, and matching efficiency can be improved.
As can be seen, step S120 converts each sound segment into a sequence S of words and phrases by speech recognition techniquesmUsing conditional constraints SmTherefore, voice segments are screened out, and further, words identified by voice can be used as materials for automatic recording and summarization of later conferences.
Step S130: and acquiring a video clip corresponding to the voice clip.
Video clip VideoiWherein i is SpeechiCorresponding to the number of the video clip, the video clips corresponding to the voice clips of S1, S4, and S10 are obtained as 1 st minute, 4 th minute, and 10 th minute, and i is 1, 2, and 3, respectively.
Step S140: and carrying out face detection on the video segment to obtain all predicted speakers of the voice segment.
Referring to fig. 3, a schematic flowchart of a predicted speaker acquiring method provided in the embodiment of the present application is shown, and as shown in fig. 3, the predicted speaker acquiring method provided in the embodiment of the present application includes the following steps:
step S401: and extracting a plurality of frames to be processed from the video clip to form a frame sequence to be processed.
Video clip VideoiThe number of all frames in the sequence is marked as
Figure BDA0002216768730000041
Wherein K is a frame number, K is more than or equal to 1 and less than or equal to Ki,VideoiHas a starting frame number of 1, and the total number of frames is Ki
And extracting one frame from the starting frame at intervals of a preset number of frames, taking the starting frame and the extracted frames as frames to be processed, and constructing a frame sequence to be processed according to all the frames to be processed.
The number of the extracted frame satisfies
Figure BDA0002216768730000042
ksAnd kDAre integers which are more than or equal to 1 and respectively represent the start frame and the frame extraction interval of the frame extraction,
Figure BDA0002216768730000043
is an integer of 0 or more and is,
Figure BDA0002216768730000044
order to
Figure BDA0002216768730000045
The sequence of frames to be processed thus constituted by the finally extracted frames is thus that of
Figure BDA0002216768730000046
In this embodiment, the first frame is used as the start frame, and one frame is extracted at intervals of 2 frames to obtain a frame sequence to be processed
Figure BDA0002216768730000047
Step S402: and carrying out face detection on the frame sequence to be processed to obtain all faces in the frame to be processed, and constructing a face sequence according to all the faces of the video segments.
Detecting each extracted frame by using a face detection algorithm
Figure BDA0002216768730000048
Face of (5), constructing a face sequenceRepresenting the p-th face of the k-th frame.
Step S403: and marking the face area corresponding to the face sequence by using a rectangular frame, and acquiring the marking coordinates of the rectangular frame and the central coordinates of the rectangular frame.
In the frame image, a face region obtained by face detection is marked by a rectangular frame, the rectangular frame is obtained according to a typical Adaboost algorithm in the embodiment of the application, and eyebrows are raised on the rectangular frame, and eyebrows are lowered to a chin, and the left and the right are trimmed. The coordinates of the marks of the rectangular frame are noted
Figure BDA00022167687300000410
Respectively representing the X and Y coordinates of the upper left corner of the rectangular box and the width and height of the rectangle,
Figure BDA00022167687300000411
x coordinates of pixel points at the upper left corner of the representation frame,Y coordinate representing pixel point at upper left corner of rectangular frame,
Figure BDA00022167687300000413
The width coordinate of the rectangular frame,
Figure BDA00022167687300000414
Indicating the height coordinate of the rectangular box.
The center coordinates of the rectangular frame are noted
Figure BDA00022167687300000415
Figure BDA00022167687300000416
Represents the center X coordinate, which is one-half of the width coordinate of the rectangular frame,
Figure BDA00022167687300000417
represents the center Y coordinate, which is one-half of the height coordinate of the rectangular box.
The central coordinates of the rectangular frame are used for representing the center of the face, different faces, different sizes of the rectangular frame areas and different positions of the central coordinates.
Step S404: and clustering all face sequences of the video segments according to the center coordinates to obtain the face categories of the video segments.
Video clip VideoiCentral coordinates of all face sequences recognized in the sequence of frames to be processed
Figure BDA00022167687300000418
A K-means clustering algorithm is performed. The center coordinates of the rectangular frames of the face of the same person in different frame sequences to be processed have some differences, but the differences are usually within a certain range, and the differences of the center coordinates of the rectangular frames of the faces of different persons are relatively large, so that K clustering centers can be obtained according to a K-means clustering algorithm to represent Video segments VideoiThere are Q predicted speakers, each of which is recorded as
Figure BDA0002216768730000051
(q≤Q,Q=K)。
Step S405: and establishing a corresponding relation between the face sequence and the face category to obtain all predicted speakers of the voice segments.
Faces belonging to the same cluster center
Figure BDA0002216768730000052
Assigning identical speaker tagsSpeaker tag
Figure BDA0002216768730000054
May be a predicted speaker name. Each face
Figure BDA0002216768730000055
Predicting the speaker in response to the unique predicted speaker
Figure BDA0002216768730000056
Multiple faces corresponding to the same person in multiple framesAfter establishing the corresponding relationship, the method comprises
Figure BDA0002216768730000058
Can search
Figure BDA0002216768730000059
By passing
Figure BDA00022167687300000510
Can also find out
Figure BDA00022167687300000511
Step S140, detecting the face in each extracted frame through a face detection algorithm to obtain a face sequence, and recording a face rectangular frame and a center coordinate; the speaker label is obtained by executing a clustering algorithm on the face center coordinate, so that the situation that different predicted speakers are distinguished by complex methods such as face detection and face tracking is avoided, the workload is low, and the efficiency is improved.
Step S150: and obtaining the hit information of each predicted speaker in the adjacent gray frames according to the pixel difference value of the adjacent gray frames in the video clip.
Only one of the Q predicted speakers is a real speaker, called a target speaker, and the only target speaker is obtained by performing hit analysis on the predicted speakers. Referring to fig. 4, a flowchart of a method for predicting a hit of a speaker provided in an embodiment of the present application is shown in fig. 4, where the method for predicting a hit of a speaker includes the following steps:
step S501: the sequence of frames to be processed is converted into a sequence of gray frames.
Combining a sequence of frames
Figure BDA00022167687300000512
Conversion to a sequence of grayscale frames
Figure BDA00022167687300000513
Step S502: and calculating the absolute value of the gray difference value of each pixel point of the adjacent gray frames in the gray frame sequence.
In a sequence of gray frames
Figure BDA00022167687300000514
And calculating the absolute value of the gray difference value of each pixel point of the current frame and the previous frame.
Step S503: and judging whether the absolute value is larger than a preset threshold value or not.
The preset threshold is used for judging whether the human face moves, for example, if the absolute values of the gray difference values of the pixel points of the lip region of the human face in the adjacent gray frames are all larger than the preset threshold, it is judged that the lip region moves, the probability that the predicted speaker corresponding to the human face is the target speaker is increased, and if the human face in a plurality of adjacent gray frames moves all the time, the predicted speaker corresponding to the human face has a higher probability of being the target speaker. And determining the size of the preset threshold according to the functional requirements.
Step S504: and if the absolute value is larger than the preset threshold value, marking the pixel point as 1.
Step S505: and if the absolute value is less than or equal to the preset threshold value, marking the pixel point as 0.
Step S506: and obtaining a frame difference image sequence corresponding to the frame sequence to be processed according to the marking values of the pixel points.
Sequence of frame difference mapsIn (2), the value of each pixel is 0 or 1.
Step S507: and counting the number of pixels with the mark value of 1 in each rectangular frame of each frame of the frame difference image sequence.
The rectangular box represents the face region, detected
Figure BDA00022167687300000516
Number of pixels having median value of 1
Figure BDA00022167687300000517
To determine whether the predicted speaker is speaking.
Step S508: and judging the predicted speaker hit corresponding to the rectangular frame once according to the condition that the number of the pixels is within a preset range.
When in useTime, judge
Figure BDA00022167687300000526
Hit once. Wherein the content of the first and second substances,
Figure BDA00022167687300000520
andrespectively, the minimum and maximum values of the preset range. Current number of pixels
Figure BDA00022167687300000522
Less than a minimum value
Figure BDA00022167687300000523
When the change of the face area in the adjacent frames is small, the face is judged not to be speaking, for example, the face can be in a state of listening to the speaking of other people; current number of pixels
Figure BDA00022167687300000524
Greater than maximum
Figure BDA00022167687300000525
And then, the change of the face area in the adjacent frames is larger, and the face is judged not to have speech, for example, the face area is probably in a laugh state, and the change range of the pixel points of the face area is larger.
In step S150, the number of pixels having a frame difference value of 1 in the face region is counted by calculating a frame difference map sequence corresponding to the gray frame sequence, a constraint condition satisfying the hit is set, and the face region is used as the constraint condition, so that the anti-interference capability of the algorithm is improved.
Step S160: and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.
The hit analysis according to step S150 may further determine who is the target speaker. Referring to fig. 5, a schematic flowchart of a target speaker acquiring method according to an embodiment of the present application is shown, and as shown in fig. 5, the target speaker acquiring method according to the embodiment of the present application includes the following steps:
step S601: and counting the hit times of each predicted speaker in all frames of the frame difference image sequence.
Video over the entire Video clipiThe number of hits of the speaking speaker should be the largest, so the number of hits of each predicted speaker in all frames of the frame difference map sequence is counted separately.
Step S602: and comparing the hit times of each predicted speaker, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.
With the highest total number of hits
Figure BDA0002216768730000061
Namely SpeechiIf two predicted speakers with the largest hit frequency appear, the target speaker may be further identified from the two predicted speakers manually.
It can be seen from the foregoing embodiments that the multi-person scene voice matching method provided in the embodiments of the present application automatically binds voices to the belonging target speakers based on voice recognition, face detection, and clustering algorithms, and does not need to collect voice feature information such as voiceprints of each person in advance, thereby greatly reducing the workload of manually matching voices and target speakers subsequently, facilitating the promotion of the practicability of visual and auditory perception technologies, being particularly suitable for conference scenes, and being used as a basis for automatically extracting the speaking abstract and viewpoint of each participant subsequently.
Since the above embodiments are all described by referring to and combining with other embodiments, the same portions are provided between different embodiments, and the same and similar portions between the various embodiments in this specification may be referred to each other. And will not be described in detail herein.
It is noted that, in this specification, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the presence of an element identified by the phrase "comprising an … …" does not exclude the presence of other like elements in a circuit structure, article or device comprising the element.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
The above-described embodiments of the present application do not limit the scope of the present application.

Claims (7)

1. A method for matching human voice in a multi-person scene is characterized by comprising the following steps:
dividing audio to be matched into a plurality of sound segments;
performing voice recognition on the sound segments to obtain voice segments in the sound segments;
acquiring a video clip corresponding to the voice clip;
carrying out face detection on the video segment to obtain all predicted speakers of the voice segment;
obtaining hit information of each predicted speaker in the adjacent gray frames according to pixel difference values of the adjacent gray frames in the video clips;
and counting the hit times of each predicted speaker in the video segment according to the hit information, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.
2. The method for matching human voice in multi-person scene according to claim 1, wherein performing voice recognition on the sound segments to obtain the voice segments in the sound segments comprises:
performing voice recognition on the sound fragment to obtain characters or words in the sound fragment, and forming all the characters or words of the sound fragment into a voice sequence;
counting the number of words in the voice sequence, and judging whether the number of words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration;
and if the number of the words is larger than or equal to the product of the speech rate adjusting factor, the reference speech rate and the segment duration, judging the sound segment to be a voice segment.
3. The method for matching human voice in multi-person scene as claimed in claim 1, wherein the step of performing face detection on the video segment to obtain all predicted speakers of the voice segment comprises:
extracting a plurality of frames to be processed from the video clip to form a frame sequence to be processed;
performing face detection on the frame sequence to be processed to obtain all faces in the frame to be processed, and constructing a face sequence according to all the faces of the video segments;
marking a face area corresponding to the face sequence by using a rectangular frame, and acquiring a marking coordinate of the rectangular frame and a central coordinate of the rectangular frame;
clustering all face sequences of the video segments according to the central coordinates to obtain face categories of the video segments, wherein each category of the face categories corresponds to a predicted speaker;
and establishing a corresponding relation between the face sequence and the face category to obtain all predicted speakers of the voice segment.
4. The multi-person scene human voice matching method according to claim 3, wherein extracting a plurality of frames to be processed from the video segment to form a sequence of frames to be processed comprises: taking a first frame of the video segment as a starting frame, extracting one frame from the starting frame at intervals of a preset number of frames, taking the starting frame and the extracted frame as frames to be processed, and constructing a frame sequence to be processed according to all the frames to be processed.
5. The method for matching human voice in a multi-person scene according to claim 3, wherein the labeled coordinates of the rectangular frame comprise an X coordinate of a pixel point at the upper left corner of the rectangular frame, a Y coordinate of a pixel point at the upper left corner of the rectangular frame, a width coordinate of the rectangular frame, and a height coordinate of the rectangular frame, the center coordinates of the rectangular frame comprise a center X coordinate and a center Y coordinate, the center X coordinate is one half of the width coordinate of the rectangular frame, and the center Y coordinate is one half of the height coordinate of the rectangular frame.
6. The method as claimed in claim 3, wherein obtaining hit information of each predicted speaker in the adjacent gray frames according to pixel difference values of the adjacent gray frames in the video segment comprises:
converting the frame sequence to be processed into a gray frame sequence;
calculating the absolute value of the gray difference value of each pixel point of the adjacent gray frames in the gray frame sequence;
judging whether the absolute value is larger than a preset threshold value or not;
if the absolute value is larger than a preset threshold value, marking the pixel point as 1;
if the absolute value is less than or equal to a preset threshold value, marking the pixel point as 0;
obtaining a frame difference image sequence corresponding to the frame sequence to be processed according to the marking value of the pixel point;
counting the number of pixels with the mark value of 1 in each rectangular frame of each frame of the frame difference image sequence;
and judging the predicted speaker hit corresponding to the rectangular frame once according to the condition that the number of the pixels is within a preset range.
7. The method for matching human voice in multi-person scene as claimed in claim 6, wherein counting the hit times of each predicted speaker in the video segment, the predicted speaker with the largest hit time being the target speaker of the voice segment, comprises:
counting the hit times of each predicted speaker in all frames of the frame difference image sequence;
and comparing the hit times of each predicted speaker, wherein the predicted speaker with the largest hit time is the target speaker of the voice segment.
CN201910918342.4A 2019-09-26 2019-09-26 Multi-person scene human voice matching method Active CN110648667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910918342.4A CN110648667B (en) 2019-09-26 2019-09-26 Multi-person scene human voice matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910918342.4A CN110648667B (en) 2019-09-26 2019-09-26 Multi-person scene human voice matching method

Publications (2)

Publication Number Publication Date
CN110648667A true CN110648667A (en) 2020-01-03
CN110648667B CN110648667B (en) 2022-04-08

Family

ID=68992864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910918342.4A Active CN110648667B (en) 2019-09-26 2019-09-26 Multi-person scene human voice matching method

Country Status (1)

Country Link
CN (1) CN110648667B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899743A (en) * 2020-07-31 2020-11-06 斑马网络技术有限公司 Method and device for acquiring target sound, electronic equipment and storage medium
CN113450773A (en) * 2021-05-11 2021-09-28 多益网络有限公司 Video recording manuscript generation method and device, storage medium and electronic equipment
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium
CN116708055A (en) * 2023-06-06 2023-09-05 深圳市艾姆诗电商股份有限公司 Intelligent multimedia audiovisual image processing method, system and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581608A (en) * 2012-07-20 2014-02-12 Polycom通讯技术(北京)有限公司 Spokesman detecting system, spokesman detecting method and audio/video conference system
CN108399923A (en) * 2018-02-01 2018-08-14 深圳市鹰硕技术有限公司 More human hairs call the turn spokesman's recognition methods and device
US20180247651A1 (en) * 2015-03-19 2018-08-30 Samsung Electronics Co., Ltd. Method and device for detecting voice activity based on image information
CN108763897A (en) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 Method of calibration, terminal device and the medium of identity legitimacy
CN109410954A (en) * 2018-11-09 2019-03-01 杨岳川 A kind of unsupervised more Speaker Identification device and method based on audio-video
CN110119711A (en) * 2019-05-14 2019-08-13 北京奇艺世纪科技有限公司 A kind of method, apparatus and electronic equipment obtaining video data personage segment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581608A (en) * 2012-07-20 2014-02-12 Polycom通讯技术(北京)有限公司 Spokesman detecting system, spokesman detecting method and audio/video conference system
US20180247651A1 (en) * 2015-03-19 2018-08-30 Samsung Electronics Co., Ltd. Method and device for detecting voice activity based on image information
CN108399923A (en) * 2018-02-01 2018-08-14 深圳市鹰硕技术有限公司 More human hairs call the turn spokesman's recognition methods and device
CN108763897A (en) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 Method of calibration, terminal device and the medium of identity legitimacy
CN109410954A (en) * 2018-11-09 2019-03-01 杨岳川 A kind of unsupervised more Speaker Identification device and method based on audio-video
CN110119711A (en) * 2019-05-14 2019-08-13 北京奇艺世纪科技有限公司 A kind of method, apparatus and electronic equipment obtaining video data personage segment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SPYRIDON SIATRAS ET AL.: "《Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities》", 《 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 》 *
潘鹏: "《会议室环境下基于音频视频信息融合的多说话人识别》", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *
韩冰: "《数字音频处理》", 31 October 2018, 西安电子科技大学出版社 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899743A (en) * 2020-07-31 2020-11-06 斑马网络技术有限公司 Method and device for acquiring target sound, electronic equipment and storage medium
CN113450773A (en) * 2021-05-11 2021-09-28 多益网络有限公司 Video recording manuscript generation method and device, storage medium and electronic equipment
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium
CN116708055A (en) * 2023-06-06 2023-09-05 深圳市艾姆诗电商股份有限公司 Intelligent multimedia audiovisual image processing method, system and storage medium
CN116708055B (en) * 2023-06-06 2024-02-20 深圳市艾姆诗电商股份有限公司 Intelligent multimedia audiovisual image processing method, system and storage medium

Also Published As

Publication number Publication date
CN110648667B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN110648667B (en) Multi-person scene human voice matching method
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
CN105405439B (en) Speech playing method and device
WO2020237855A1 (en) Sound separation method and apparatus, and computer readable storage medium
US10304458B1 (en) Systems and methods for transcribing videos using speaker identification
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN103856689B (en) Character dialogue subtitle extraction method oriented to news video
CN110853646B (en) Conference speaking role distinguishing method, device, equipment and readable storage medium
US7305128B2 (en) Anchor person detection for television news segmentation based on audiovisual features
CN112088402A (en) Joint neural network for speaker recognition
WO2019140161A1 (en) Systems and methods for decomposing a video stream into face streams
US20030236663A1 (en) Mega speaker identification (ID) system and corresponding methods therefor
CN112183334B (en) Video depth relation analysis method based on multi-mode feature fusion
CN105007395A (en) Privacy processing method for continuously recording video
CN111785275A (en) Voice recognition method and device
CN112037788B (en) Voice correction fusion method
CN111341350A (en) Man-machine interaction control method and system, intelligent robot and storage medium
CN111091840A (en) Method for establishing gender identification model and gender identification method
Brutti et al. Online cross-modal adaptation for audio–visual person identification with wearable cameras
Tapu et al. DEEP-HEAR: A multimodal subtitle positioning system dedicated to deaf and hearing-impaired people
CN101827224A (en) Detection method of anchor shot in news video
CN114022923A (en) Intelligent collecting and editing system
CN112992148A (en) Method and device for recognizing voice in video
US20220335752A1 (en) Emotion recognition and notification system
Vajaria et al. Exploring co-occurence between speech and body movement for audio-guided video localization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant