WO2021120190A1 - Data processing method and apparatus, electronic device, and storage medium - Google Patents

Data processing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2021120190A1
WO2021120190A1 PCT/CN2019/127087 CN2019127087W WO2021120190A1 WO 2021120190 A1 WO2021120190 A1 WO 2021120190A1 CN 2019127087 W CN2019127087 W CN 2019127087W WO 2021120190 A1 WO2021120190 A1 WO 2021120190A1
Authority
WO
WIPO (PCT)
Prior art keywords
person
target
data
video data
audio data
Prior art date
Application number
PCT/CN2019/127087
Other languages
French (fr)
Chinese (zh)
Inventor
赵亮
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/127087 priority Critical patent/WO2021120190A1/en
Priority to CN201980100983.7A priority patent/CN114556469A/en
Publication of WO2021120190A1 publication Critical patent/WO2021120190A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.
  • the first terminal uses the voice collection module (such as microphone, microphone array) to collect the content in real time, that is, Audio data.
  • a communication connection can be established between the first terminal and a server for simultaneous interpretation, and the first terminal sends the acquired audio data to the server for simultaneous interpretation, and the server can obtain all the audio data in real time.
  • the audio data At the same time, the second terminal uses a video capture module (such as a camera) to shoot a video for the speaker in real time, that is, collect video data.
  • a communication connection can be established between the second terminal and the server for simultaneous interpretation, and the second terminal sends the collected video data to the server for simultaneous interpretation, and the server can obtain all the data in real time.
  • the video data The first terminal collects audio data and the second terminal collects video data, and the server can obtain the target data.
  • the image recognition model may be obtained by pre-training a specific neural network, and the image recognition model may be used to recognize video data and determine the person corresponding to the face image of the target person in the video data, that is, the speaker in the video data.
  • the method may further include:
  • the voice collection module may be a microphone array with multiple channels, so that the sound source localization technology can be used to determine the position of the sound source, that is, the position of the target person (speaker) in the real scene. Specifically, determining the sound source location corresponding to the audio data includes:
  • the voiceprint database is queried according to the target voiceprint feature, and the first speaker corresponding to the audio data is determined.
  • the video collection module may be a camera, and the video collection module collecting video data includes: shooting a video of a specific location in a corresponding real scene through the camera; and the captured video includes at least one speaker.
  • the target person may move around, and the corresponding mark of the target person (such as the preset icon, the outline frame for the target person) should also follow the target person
  • the target position changes and changes; here, the target person can be tracked in the image frame of the subsequent video data by using the target tracking technology, the changed target position can be determined instantly, and the mark for the target person can be added correspondingly, and finally realized The effect that the mark changes with the change of the target position of the target person.
  • a mark is added to the played video data for the target speaker according to the target location, and the text corresponding to the content said by the speaker is presented in the mark.
  • Step 509 Synthesize speech according to the target language text, and play the synthesized speech.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A data processing method and apparatus, a server and a storage medium. The method comprises: acquiring target data (301), the target data comprising video data and audio data; using the target data to perform a person recognition operation on a target person, the target person corresponding to a speaker involved in the target data (302); and adding into the video data a mark for the target person, the mark and recognized text determined on the basis of the audio data being displayed while the video data is played back (303).

Description

数据处理方法、装置、电子设备和存储介质Data processing method, device, electronic equipment and storage medium 技术领域Technical field
本申请涉及同声传译技术,具体涉及一种数据处理方法、装置、电子设备和存储介质。This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.
背景技术Background technique
随着人工智能(AI,Artificial Intelligence)技术不断发展与成熟,运用人工智能技术解决生活中常见问题的产品不断涌现。其中,机器同声传译(又称为机器同传、AI同声传译、AI同传),结合了语音识别(ASR,Automatic Speech Recognition)、机器翻译(MT,Machine Translation)、语音合成(TTS,Text-To-Speech)等技术,被广泛应用于会议、访谈节目等场景,替代或部分替代了人工,实现同声传译(SI,Simultaneous Interpretation)。With the continuous development and maturity of artificial intelligence (AI) technology, products that use artificial intelligence technology to solve common problems in life continue to emerge. Among them, machine simultaneous interpretation (also known as machine simultaneous interpretation, AI simultaneous interpretation, AI simultaneous interpretation) combines speech recognition (ASR, Automatic Speech Recognition), machine translation (MT, Machine Translation), and speech synthesis (TTS, Text-To-Speech) and other technologies are widely used in conferences, talk shows and other scenes, replacing or partially replacing manual work to achieve Simultaneous Interpretation (SI).
相关机器同传***中,通过语言识别技术自动识别语音,运用机器翻译技术将识别得到的源语言文字翻译成目标语言文字,通过屏幕直接展示翻译后的结果,还可以通过语音合成技术将目标语言文字转换为语音后播报。然而,仅将演讲者的语音同步显示和播放,用户在观看时无法确定演讲者,难以结合演讲者的身份理解演讲内容。In the related machine simultaneous interpretation system, speech is automatically recognized through language recognition technology, and machine translation technology is used to translate the recognized source language text into the target language text, and the translation result can be directly displayed on the screen, and the target language can also be translated into the target language through speech synthesis technology. The text is converted into voice and then broadcast. However, only the speech of the lecturer is displayed and played simultaneously, and the user cannot determine the lecturer while watching, and it is difficult to understand the content of the lecture based on the identity of the lecturer.
发明内容Summary of the invention
为解决相关技术问题,本申请实施例提供了一种数据处理方法、装置、电子设备和存储介质。To solve related technical problems, embodiments of the present application provide a data processing method, device, electronic equipment, and storage medium.
本申请实施例提供了一种数据处理方法,包括:The embodiment of the application provides a data processing method, including:
获取目标数据;所述目标数据包括视频数据和音频数据;Acquiring target data; the target data includes video data and audio data;
利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;Using the target data to perform a character recognition operation for a target person, the target person corresponding to the speaker in the target data;
在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。A mark for the target person is added to the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played.
本申请实施例还提供了一种数据处理装置,包括:The embodiment of the present application also provides a data processing device, including:
获取单元,配置为获取目标数据;所述目标数据包括视频数据和音频数据;An acquiring unit configured to acquire target data; the target data includes video data and audio data;
第一处理单元,配置为利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;A first processing unit configured to use the target data to perform a person recognition operation for a target person, the target person corresponding to the speaker in the target data;
第二处理单元,配置为在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。The second processing unit is configured to add a mark for the target person to the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played.
本申请实施例又提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任一数据处理方法的步骤。The embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements any of the foregoing data processing methods when the program is executed. step.
本申请实施例还提供了一种存储介质,其上存储有计算机指令,所述指令被处理器执行时实现上述任一数据处理方法的步骤。The embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the instructions are executed by a processor, the steps of any of the foregoing data processing methods are implemented.
本申请实施例提供的数据处理方法、装置、电子设备和存储介质,获取目标数据; 所述目标数据包括视频数据和音频数据;利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现;如此,能够确定识别文本对应的目标人物,并针对目标人物对应标记所述目标人物所说的内容(即所述识别文本),便于用户将说话者(即目标人物)与说话者所说的内容对应,结合目标人物的身份理解目标人物所说的内容,从而能够准确帮助用户理解所说内容,提升用户体验。The data processing method, device, electronic device, and storage medium provided in the embodiments of the present application obtain target data; the target data includes video data and audio data; the target data is used to perform a person recognition operation for a target person, and the target The person corresponds to the speaker in the target data; a tag for the target person is added to the video data; the tag and the recognized text determined based on the audio data are presented when the video data is played; so , Can determine the target person corresponding to the recognition text, and mark the content of the target person (ie the recognized text) corresponding to the target person, so that the user can match the speaker (ie the target person) with the content said by the speaker , Combine the identity of the target person to understand what the target person says, so as to accurately help the user understand the content and improve the user experience.
附图说明Description of the drawings
图1为相关技术中同声传译方法应用的***架构示意图;Figure 1 is a schematic diagram of the system architecture of the application of simultaneous interpretation methods in related technologies;
图2为相关技术中同声传译方法的流程示意图Figure 2 is a schematic flow diagram of the simultaneous interpretation method in related technologies
图3为本申请实施例的数据处理方法的一种流程示意图;FIG. 3 is a schematic flowchart of a data processing method according to an embodiment of the application;
图4为本申请实施例的标记的示意图;Fig. 4 is a schematic diagram of a mark of an embodiment of the application;
图5为本申请实施例的数据处理方法的另一种流程示意图;FIG. 5 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application;
图6为本申请实施例的数据处理装置的组成结构示意图;6 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application;
图7为本申请实施例的电子设备的组成结构示意图。FIG. 7 is a schematic diagram of the composition structure of an electronic device according to an embodiment of the application.
具体实施方式Detailed ways
下面结合附图及具体实施例对本申请作进一步详细的说明。The application will be further described in detail below in conjunction with the drawings and specific embodiments.
图1为相关技术中同声传译方法应用的***架构示意图;如图1所示,所述***可包括:机器同传服务端、语音处理服务器、观众移动端、个人电脑(PC,Personal Computer)客户端、显示屏幕。Figure 1 is a schematic diagram of the system architecture of the application of the simultaneous interpretation method in the related technology; as shown in Figure 1, the system may include: a machine simultaneous interpretation server, a voice processing server, a viewer mobile terminal, and a personal computer (PC, Personal Computer) Client, display screen.
实际应用中,演讲者可以通过PC客户端进行会议演讲,在进行会议演讲的过程中,PC客户端采集演讲者的音频数据,将采集的音频数据发送给机器同传服务端,所述机器同传服务端通过语音处理服务器对音频数据进行识别,得到识别结果;机器同传服务端可以将识别结果发送给PC客户端,由PC客户端将识别结果投屏到显示屏幕上;还可以将识别结果发送给观众移动端(具体依据用户所需的语种,对应发送相应语种的识别结果),为用户展示识别结果,从而实现将演讲者的演讲内容翻译成用户需要的语种并进行展示。In practical applications, the lecturer can give conference lectures through the PC client. During the conference lecture, the PC client collects the lecturer's audio data and sends the collected audio data to the machine simultaneous interpretation server. The transmission server recognizes the audio data through the voice processing server and obtains the recognition result; the machine simultaneous interpretation server can send the recognition result to the PC client, and the PC client screens the recognition result on the display screen; it can also send the recognition result to the display screen. The result is sent to the audience's mobile terminal (specifically according to the language required by the user, the recognition result of the corresponding language is sent correspondingly), and the recognition result is displayed to the user, so that the content of the lecturer's speech is translated into the language required by the user and displayed.
所述识别结果可以包括以下至少之一:与音频数据相同语种的识别文本(记做第一识别文本)、对所述第一识别文本进行翻译后得到的其他语种的翻译文本(记做第二识别文本)、基于所述第二识别文本生成的音频数据。The recognition result may include at least one of the following: recognized text in the same language as the audio data (denoted as the first recognized text), and translated text in other languages obtained by translating the first recognized text (denoted as the second Recognized text), audio data generated based on the second recognized text.
结合图2所示的同声传译方法,详细说明所述***采集音频数据,对音频数据进行识别以得到识别结果,并将识别结果发送给PC客户端。In conjunction with the simultaneous interpretation method shown in Figure 2, it is described in detail that the system collects audio data, recognizes the audio data to obtain the recognition result, and sends the recognition result to the PC client.
所述PC客户端可以设有语音采集模块(如麦克风),通过语音采集模块采集演讲者演讲时的音频数据,将采集的音频数据发送给机器同传服务端。The PC client may be provided with a voice collection module (such as a microphone), which collects audio data of the speaker during a speech, and sends the collected audio data to the machine simultaneous interpretation server.
所述机器同传服务端通过语音处理服务器对音频数据进行语音识别,得到源语言文字,即上述第一识别文本;以及,对所述源语言文字进行机器翻译,得到目标语言文字,即上述第二识别文本;The machine simultaneous interpretation server performs voice recognition on the audio data through the voice processing server to obtain the source language text, that is, the first recognized text; and, performs machine translation on the source language text to obtain the target language text, that is, the first recognized text. 2. Recognize the text;
所述机器同传服务端通过PC客户端将识别结果投屏到显示屏幕上进行显示;这里,所述识别结果可以包括以下至少之一:所述第一识别文本、所述第二识别文本;The machine simultaneous interpretation server screens the recognition result on the display screen for display through the PC client; here, the recognition result may include at least one of the following: the first recognized text and the second recognized text;
所述机器同传服务端还可以将识别结果发送给观众移动端,通过观众移动端的屏幕进行显示。The machine simultaneous interpretation server can also send the recognition result to the viewer's mobile terminal, and display it on the screen of the viewer's mobile terminal.
所述机器同传服务端还可以通过语音处理服务器对第二识别文本进行语音合成,并将合成后的音频数据通过观众移动端的耳机进行播放。The machine simultaneous interpretation server may also perform voice synthesis on the second recognized text through the voice processing server, and play the synthesized audio data through the headset on the viewer's mobile terminal.
上述同声传译方法可以实现语音识别和翻译,但是在访谈、会议等多人场景,每个说话者具有不同的身份背景、角色(如采访者、被采访者),所说的语音内容往往体现着说话者自己的观点和立场,如果同传的语音内容能清晰的对应说话者,结合说话者的角色和立场,对理解语音内容非常有帮助。然而,上述同声传译方法并不能给出说话者与语音内容的对应关系,需要观众自行分析并确定说话者和语音内容的对应关系,增加了理解的难度。The above simultaneous interpretation methods can realize speech recognition and translation. However, in multi-person scenarios such as interviews and meetings, each speaker has a different identity background and role (such as interviewer, interviewee), and the spoken voice content often reflects Based on the speaker’s own point of view and position, if the voice content of the simultaneous interpretation can clearly correspond to the speaker, combined with the speaker’s role and position, it is very helpful to understand the voice content. However, the aforementioned simultaneous interpretation methods cannot provide the corresponding relationship between the speaker and the voice content, and the audience needs to analyze and determine the corresponding relationship between the speaker and the voice content by themselves, which increases the difficulty of understanding.
基于此,在本申请的各种实施例中,获取目标数据;所述目标数据包括视频数据和音频数据;利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现;如此,能够确定识别文本(可以是上述第一识别文本、第二识别文本)对应的目标人物(即上述说话者),并对目标人物对应标记所述目标人物所说的内容(即识别文本),便于用户将目标人物与目标人物所说的内容对应,结合目标人物的身份理解目标人物所说的内容,从而能够准确帮助用户理解所说内容,提升用户体验。Based on this, in various embodiments of the present application, target data is acquired; the target data includes video data and audio data; the target data is used to perform a character recognition operation for a target person, and the target person corresponds to the target The speaker in the data; adding a mark for the target person to the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played; in this way, the recognized text can be determined (It can be the first recognized text, the second recognized text) corresponding to the target person (that is, the speaker), and the target person is correspondingly marked with what the target person says (ie, the recognition text), so that the user can easily identify the target person Corresponding to the content said by the target person, combined with the identity of the target person to understand the content said by the target person, so as to accurately help users understand the content and improve the user experience.
本申请实施例提供了一种数据处理方法,图3为本申请实施例的数据处理方法的一种流程示意图;如图3所示,所述方法包括:The embodiment of the application provides a data processing method. FIG. 3 is a schematic flowchart of the data processing method of the embodiment of the application; as shown in FIG. 3, the method includes:
步骤301:获取目标数据;所述目标数据包括视频数据和音频数据;Step 301: Obtain target data; the target data includes video data and audio data;
这里,所述视频数据可以在所述音频数据被播放时进行呈现,也就是说,在同声传译场景,所述音频数据和所述视频数据可以被同时播放。Here, the video data may be presented when the audio data is played, that is, in a simultaneous interpretation scene, the audio data and the video data may be played at the same time.
步骤302:利用所述目标数据执行针对目标人物的人物识别操作;Step 302: Use the target data to perform a person recognition operation for the target person;
即利用所述目标数据执行相应的人物识别操作,以确定所述目标数据对应的目标人物和所述目标人物在所述视频数据中的目标位置;That is, use the target data to perform a corresponding person recognition operation to determine the target person corresponding to the target data and the target position of the target person in the video data;
其中,所述目标人物对应所述目标数据中的说话者。Wherein, the target person corresponds to the speaker in the target data.
步骤303:在所述视频数据中添加针对所述目标人物的标记;Step 303: Add a mark for the target person to the video data;
其中,所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。Wherein, the mark and the recognized text determined based on the audio data are presented when the video data is played.
具体来说,所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现,指播放所述视频数据的同时,所述视频数据中呈现出标记以及基于所述音频数据确定的识别文本。Specifically, the mark and the recognized text determined based on the audio data are presented when the video data is played, which means that while the video data is being played, the mark is presented in the video data and based on the audio data. Definite recognition text.
即所述数据数据处理方法可以应用于同声传译的场景,尤其可以适用于多人的同声传译的场景,如采访、多人会议等。That is to say, the data processing method can be applied to the scene of simultaneous interpretation, especially applicable to the scene of simultaneous interpretation of multiple people, such as interviews, multi-person meetings, and so on.
具体来说,在同声传译场景下,当说话者说话时,第一终端(如图1所示的PC客户端)利用语音采集模块(如麦克风、麦克风阵列)实时采集所说内容,即得到音频数据。所述第一终端与用于实现同声传译的服务器之间可以建立通信连接,所述第一终端将获取的音频数据发送给用于实现同声传译的服务器,所述服务器即可实时获取所述音频数据。同时,第二终端利用视频采集模块(如摄像头)实时拍摄针对说话者的视频,即采集视频数据。所述第二终端与用于实现同声传译的服务器之间可以建立通信连接,所述第二终端将采集的视频数据发送给用于实现同声传译的服务器,所述服务器即可实时获取所述视频数据。通过第一终端采集音频数据,第二终端采集视频数据,所述服务器即可获得目标数据。Specifically, in the simultaneous interpretation scenario, when the speaker speaks, the first terminal (the PC client as shown in Figure 1) uses the voice collection module (such as microphone, microphone array) to collect the content in real time, that is, Audio data. A communication connection can be established between the first terminal and a server for simultaneous interpretation, and the first terminal sends the acquired audio data to the server for simultaneous interpretation, and the server can obtain all the audio data in real time. The audio data. At the same time, the second terminal uses a video capture module (such as a camera) to shoot a video for the speaker in real time, that is, collect video data. A communication connection can be established between the second terminal and the server for simultaneous interpretation, and the second terminal sends the collected video data to the server for simultaneous interpretation, and the server can obtain all the data in real time. The video data. The first terminal collects audio data and the second terminal collects video data, and the server can obtain the target data.
所述同声传译场景可以采用如图1所示***架构,本申请实施例的数据处理方法可以应用于电子设备,所述电子设备可以是在图1***架构中新增加的设备,也可以是对 图1架构中某一设备(如图1中的机器同传服务端、语音处理服务器、观众移动端)进行改进,以能够实现本申请实施例的方法即可。所述电子设备可以是服务器、用户持有的终端等。The simultaneous interpretation scene may adopt the system architecture shown in FIG. 1, and the data processing method of the embodiment of the present application may be applied to an electronic device. The electronic device may be a newly added device in the system architecture of FIG. 1, or it may be A certain device in the architecture of FIG. 1 (such as the machine simultaneous interpretation server, the voice processing server, and the audience mobile terminal in FIG. 1) can be improved to be able to implement the method of the embodiment of the present application. The electronic device may be a server, a terminal held by a user, or the like.
具体来说,实际应用时,所述电子设备可以为服务器,所述服务器获取目标数据,运用本申请实施例提供的数据处理方法处理所述目标数据,得到处理结果;所述处理结果可以由用户持有的终端通过终端的人机交互界面进行显示;所述处理结果还可以由所述服务器投屏到显示屏幕,通过显示屏幕展示。所述处理结果可以包括添加有标记的视频数据。Specifically, in actual applications, the electronic device may be a server, and the server obtains target data, uses the data processing method provided in the embodiments of the present application to process the target data, and obtains the processing result; the processing result may be determined by the user The held terminal is displayed through the terminal's human-computer interaction interface; the processing result can also be screened by the server to the display screen, and displayed on the display screen. The processing result may include video data to which a mark is added.
所述电子设备也可以为具有或连接有人机交互界面的服务器,则所述处理结果可以由所述服务器的人机交互界面显示。The electronic device may also be a server with or connected to a human-computer interaction interface, and the processing result may be displayed by the human-computer interaction interface of the server.
这里,所述服务器可以是在图1***架构中新增加的服务器,用于实现本申请方法(即图2所示方法),也可以是对图1架构中所述语音处理服务器、机器同传服务端进行改进,以实现本申请方法即可。Here, the server may be a newly added server in the system architecture of FIG. 1 to implement the method of this application (ie, the method shown in FIG. 2), or it may be a simultaneous interpretation of the voice processing server and the machine in the architecture of FIG. The server can be improved to realize the application method.
所述电子设备还可以为用户持有的终端,所述用户持有的终端可以获取目标数据,运用本申请实施例提供的数据处理方法处理所述目标数据,得到处理结果,并通过自身具有的人机交互界面显示所述处理结果。The electronic device may also be a terminal held by a user, and the terminal held by the user may obtain target data, process the target data using the data processing method provided in the embodiments of the present application, and obtain the processing result, and use its own The human-computer interaction interface displays the processing result.
这里,所述用户持有的终端可以是在图1***架构中新增加的可实现本申请方法的终端,也可以是对图1架构中所述观众移动端进行改进,以实现本申请方法即可。这里,所述用户持有的终端可以为PC、手机等。Here, the terminal held by the user may be a newly added terminal in the system architecture of FIG. 1 that can implement the method of this application, or it may be an improvement to the mobile terminal of the audience in the architecture of FIG. 1 to implement the method of this application. can. Here, the terminal held by the user may be a PC, a mobile phone, or the like.
需要说明的是,所述服务器或所述用户持有的终端也可以设有或连接有语音采集模块和视频采集模块,通过自身设有或连接有的语音采集模块和视频采集模块采集音频数据和视频数据,即可获得所述目标数据。It should be noted that the server or the terminal held by the user may also be equipped with or connected to a voice collection module and a video collection module, and collect audio data and data through the voice collection module and video collection module provided or connected to it. Video data, the target data can be obtained.
本申请实施例中,所述数据处理方法应用于采访、会议等同声传译场景下,所述音频数据和所述视频数据针对同一采访或同一会议。In the embodiment of the present application, the data processing method is applied in an interview or conference equivalent sound interpretation scenario, and the audio data and the video data are for the same interview or the same conference.
随着采访或会议的进行,音频数据将不断变化,所述识别文本也随着音频数据的变化而不断变化。若是多人参与的同声传译场景,目标人物(即说话者)也可能变化,则所述目标人物对应的识别文本随着音频数据的变化而不断变化。As the interview or meeting progresses, the audio data will continue to change, and the recognized text will also continue to change as the audio data changes. If it is a simultaneous interpretation scene where multiple people participate, the target person (that is, the speaker) may also change, and the recognition text corresponding to the target person changes continuously with the change of audio data.
实际应用时,运用所述人物识别操作,确定所述目标数据对应的目标人物和所述目标人物在所述视频数据中的目标位置。In actual application, the person recognition operation is used to determine the target person corresponding to the target data and the target position of the target person in the video data.
具体地,所述人物识别操作,包括:Specifically, the person recognition operation includes:
根据所述音频数据识别所述目标人物;Identifying the target person according to the audio data;
根据所述目标数据识别所述目标人物的目标位置;Identifying the target location of the target person according to the target data;
所述在所述视频数据中添加针对所述目标人物的标记,包括:The adding a mark for the target person to the video data includes:
根据所述目标位置在所述视频数据中添加针对所述目标人物的标记。Adding a mark for the target person to the video data according to the target position.
实际应用时,可以通过对所述视频数据和通过麦克风采集的音频数据进行人物识别,得到目标人物和目标位置。In practical applications, the target person and target position can be obtained by performing person recognition on the video data and audio data collected through a microphone.
具体地,所述根据所述音频数据识别所述目标人物,包括:Specifically, the identifying the target person according to the audio data includes:
确定所述音频数据对应的目标声纹特征,根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的目标人物;所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物;Determine the target voiceprint feature corresponding to the audio data, query the voiceprint database according to the target voiceprint feature, and determine the target person corresponding to the audio data; the voiceprint database includes at least one voiceprint model and the at least one voiceprint database. The character corresponding to the voiceprint model;
所述根据所述目标数据识别所述目标人物的目标位置,包括:The recognizing the target position of the target person according to the target data includes:
从图像数据库中确定所述目标人物对应的面部图像,根据所述面部图像对所述视频数据中的至少一帧图像进行图像识别,确定所述目标人物在所述视频数据中的所述目标位置。Determine the facial image corresponding to the target person from an image database, perform image recognition on at least one frame of image in the video data according to the facial image, and determine the target position of the target person in the video data .
这里,所述音频数据可以通过麦克风采集得到。Here, the audio data can be collected through a microphone.
这里,所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;Here, the voiceprint database includes at least one voiceprint model and character information corresponding to the at least one voiceprint model;
所述图像数据库包括至少一个人物的面部图像和所述至少一个人物的面部图像对应的人物信息;The image database includes a facial image of at least one person and character information corresponding to the facial image of the at least one person;
所述人物信息包括:身份标识(ID,Identity Document);同一人物在所述声纹数据库和所述图像数据库中采用同一ID;从而,根据确定的目标人物查询所述图像数据库,可以从图像数据库中确定所述目标人物对应的面部图像。The person information includes: Identity Document (ID, Identity Document); the same person uses the same ID in the voiceprint database and the image database; therefore, the image database can be inquired from the image database according to the determined target person Determine the facial image corresponding to the target person in.
这里,所述电子设备可以运用预设的声纹识别模型对所述音频数据进行声纹识别,得到对应的目标声纹特征;根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的说话者,即目标人物。Here, the electronic device may use a preset voiceprint recognition model to perform voiceprint recognition on the audio data to obtain the corresponding target voiceprint feature; query the voiceprint database according to the target voiceprint feature to determine the audio data The corresponding speaker is the target person.
所述声纹识别模型可以是预先对特定神经网络训练得到,所述声纹识别模型用于确定音频数据对应的说话者。The voiceprint recognition model may be obtained by pre-training a specific neural network, and the voiceprint recognition model is used to determine the speaker corresponding to the audio data.
这里,所述电子设备可以运用预设的图像识别模型对所述视频数据中的至少一帧图像进行图像识别,确定所述目标人物的面部图像在所述视频数据中对应的人物,即为说话者,也就是说确定目标人物。Here, the electronic device may use a preset image recognition model to perform image recognition on at least one frame of image in the video data, and determine the corresponding person in the video data of the facial image of the target person, that is, speaking The person, that is, to determine the target person.
所述图像识别模型可以是预先对特定神经网络训练得到,所述图像识别模型可以用于识别视频数据,确定目标人物的面部图像在视频数据中对应的人物,即为视频数据中的说话者。The image recognition model may be obtained by pre-training a specific neural network, and the image recognition model may be used to recognize video data and determine the person corresponding to the face image of the target person in the video data, that is, the speaker in the video data.
实际应用时,为了确定视频数据中的目标人物,可以预先获得图像数据库,以根据视频数据中的图像确定目标人物。In practical applications, in order to determine the target person in the video data, an image database can be obtained in advance to determine the target person based on the image in the video data.
基于此,在一实施例中,所述方法还可以包括:Based on this, in an embodiment, the method may further include:
采集至少一个人物的面部图像和所述至少一个人物的面部图像中各面部图像对应的人物(用人物的ID记录);Collecting a facial image of at least one person and a person corresponding to each facial image in the facial image of the at least one person (recorded with the person's ID);
对各面部图像进行图像识别,提取各人物的面部图像特征;Perform image recognition on each facial image and extract facial image features of each person;
将各人物的ID与对应的面部图像特征对应保存在图像数据库中。The ID of each person and the corresponding facial image feature are stored in the image database.
所述图像数据库中可以保存有:人物的ID、对应的面部图像特征。The image database may store: the ID of the person and the corresponding facial image feature.
这里,针对所述视频数据,可以预先设置一个第一空间坐标系,根据所述目标人物在所述视频数据中相应帧图像中的位置,可以确定目标人物在第一空间坐标系中的位置,作为所述目标位置。Here, for the video data, a first spatial coordinate system can be preset, and the position of the target person in the first spatial coordinate system can be determined according to the position of the target person in the corresponding frame image in the video data, As the target location.
具体来说,视频数据中每帧图像为相同形状(如方形)的图像,所述第一空间坐标系与图像存在映射关系,基于所述目标人物在图像中的相应位置,即可确定所述目标人物在所述第一空间坐标系中的目标位置;也即,所述目标人物在所述视频数据中的目标位置,可以理解为所述目标人物在所述第一空间坐标系中的位置。Specifically, each frame of image in the video data is an image of the same shape (such as a square), and the first spatial coordinate system has a mapping relationship with the image. Based on the corresponding position of the target person in the image, the The target position of the target person in the first spatial coordinate system; that is, the target position of the target person in the video data can be understood as the position of the target person in the first spatial coordinate system .
实际应用时,为了确定音频数据对应的目标人物,可以预先获得声纹数据库,以根据音频数据对应的目标声纹特征确定目标人物。In practical applications, in order to determine the target person corresponding to the audio data, a voiceprint database may be obtained in advance to determine the target person according to the target voiceprint feature corresponding to the audio data.
基于此,在一实施例中,所述方法还包括:Based on this, in an embodiment, the method further includes:
采集至少一个人物的声音和所述至少一个人物中各人物对应的ID;Collecting the voice of at least one character and the ID corresponding to each character in the at least one character;
对所述至少一个人物中各人物的声音进行声纹识别,确定各人物的声音的声纹模型;Performing voiceprint recognition on the voice of each character in the at least one character, and determining a voiceprint model of each character's voice;
将各人物的ID与对应的声纹模型对应保存在声纹数据库中。The ID of each person and the corresponding voiceprint model are correspondingly saved in the voiceprint database.
所述声纹数据库中可以保存有:人物的ID、对应的声纹模型。The voiceprint database may store: the ID of the person and the corresponding voiceprint model.
实际应用时,可以通过对麦克风阵列采集的音频数据进行人物识别,得到目标人物和目标位置。In practical applications, the target person and target position can be obtained by performing person recognition on the audio data collected by the microphone array.
具体地,所述根据所述音频数据识别所述目标人物,包括:Specifically, the identifying the target person according to the audio data includes:
确定所述音频数据对应的目标声纹特征;Determining the target voiceprint feature corresponding to the audio data;
根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的目标人物;所述声纹数据库包括至少一个声纹特征和所述至少一个声纹特征中各声纹特征对应的人物;Query a voiceprint database according to the target voiceprint feature to determine the target person corresponding to the audio data; the voiceprint database includes at least one voiceprint feature and a person corresponding to each voiceprint feature in the at least one voiceprint feature;
所述根据所述目标数据识别所述目标人物的目标位置,包括:The recognizing the target position of the target person according to the target data includes:
确定所述音频数据对应的声源位置;Determine the sound source location corresponding to the audio data;
根据所述声源位置确定所述目标人物在所述视频数据的目标位置。The target position of the target person in the video data is determined according to the position of the sound source.
这里,所述声源位置可以结合采集音频数据的语音采集模块的位置来确定。Here, the position of the sound source may be determined in combination with the position of the voice collection module that collects audio data.
所述语音采集模块可以为具有多声道的麦克风阵列,从而可以运用声源定位技术,确定声源位置,即目标人物(说话者)在现实场景下的位置。具体来说,确定所述音频数据对应的声源位置,包括:The voice collection module may be a microphone array with multiple channels, so that the sound source localization technology can be used to determine the position of the sound source, that is, the position of the target person (speaker) in the real scene. Specifically, determining the sound source location corresponding to the audio data includes:
获取麦克风阵列(具有多声道)中各声道的声音数据;Acquire the sound data of each channel in the microphone array (with multiple channels);
根据所述各声道的声音数据进行位置计算,得到所述声源位置。Perform position calculation according to the sound data of each channel to obtain the sound source position.
这里,根据所述各声道的声音数据进行位置计算(即所述声源定位技术)中所运用的定位算法可以采用任意一种定位算法,这里不作限定,例如可以采用多元阵列的定位算法等。Here, the localization algorithm used in the position calculation based on the sound data of each channel (that is, the sound source localization technology) can use any localization algorithm, which is not limited here, for example, a multi-element array localization algorithm can be used. .
具体地,所述根据所述声源位置确定所述目标人物在所述视频数据的目标位置,包括:Specifically, the determining the target position of the target person in the video data according to the sound source position includes:
基于位置映射关系,确定所述声源位置对应于视频数据中的目标位置;所述位置映射关系表征现实场景与视频数据中场景的位置映射关系。Based on the location mapping relationship, it is determined that the sound source location corresponds to the target location in the video data; the location mapping relationship represents the location mapping relationship between the real scene and the scene in the video data.
实际应用中,为了根据现实场景中目标人物的位置(即声源位置),确定目标人物在视频数据中的对应位置、即目标位置,可以预先构建一个位置映射关系。In practical applications, in order to determine the corresponding position of the target person in the video data, that is, the target position, according to the position of the target person in the real scene (ie, the position of the sound source), a position mapping relationship can be constructed in advance.
基于此,在一实施例中,所述方法还包括:Based on this, in an embodiment, the method further includes:
根据视频数据对应的现实场景,构建一个第二空间坐标系;Construct a second spatial coordinate system according to the actual scene corresponding to the video data;
建立所述第二空间坐标系和所述第一空间坐标系(即上述针对所述视频数据构建的第一空间坐标系)的对应关系,得到所述位置映射关系。A correspondence relationship between the second spatial coordinate system and the first spatial coordinate system (that is, the above-mentioned first spatial coordinate system constructed for the video data) is established to obtain the position mapping relationship.
所述声源位置表征所述目标人物在现实场景中的位置,基于所述位置映射关系确定所述声源位置对应于视频数据中的位置,即确定目标人物对应于视频数据中的目标位置。The sound source position represents the position of the target person in the real scene, and it is determined based on the position mapping relationship that the sound source position corresponds to the position in the video data, that is, it is determined that the target person corresponds to the target position in the video data.
实际应用时,可以通过对所述视频数据和通过麦克风阵列采集的所述音频数据进行人物识别,得到目标人物和目标位置。In practical applications, the target person and target position can be obtained by performing person recognition on the video data and the audio data collected through the microphone array.
具体地,所述人物识别操作,包括:Specifically, the person recognition operation includes:
根据所述视频数据和所述音频数据识别所述目标人物和所述目标人物的目标位置;Identifying the target person and the target position of the target person according to the video data and the audio data;
所述在所述视频数据中添加针对所述目标人物的标记,包括:The adding a mark for the target person to the video data includes:
根据所述目标位置在所述视频数据中添加针对所述目标人物的标记。Adding a mark for the target person to the video data according to the target position.
这里,所述根据所述视频数据和所述音频数据识别所述目标人物和所述目标人物的目标位置,包括:Here, the identifying the target person and the target position of the target person according to the video data and the audio data includes:
对所述视频数据中至少一帧图像进行图像识别,确定所述视频数据中至少一个第一人物和至少一个第一人物在所述视频数据中的人物位置;Performing image recognition on at least one frame of image in the video data, and determining the position of at least one first person and at least one first person in the video data in the video data;
确定所述音频数据对应的目标声纹特征;根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的第二人物;根据所述音频数据,确定所述音频数据对应的声源位置;Determine the target voiceprint feature corresponding to the audio data; query the voiceprint database according to the target voiceprint feature to determine the second person corresponding to the audio data; determine the sound source corresponding to the audio data according to the audio data position;
将所述至少一个第一人物的人物位置和所述声源位置映射到预设的空间坐标系;Mapping the character position of the at least one first character and the sound source position to a preset spatial coordinate system;
确定映射到所述预设的空间坐标系中的符合预设距离条件的第一人物位置和第一声源位置,利用所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物确定目标人物,并将所述目标人物在所述预设的空间坐标系中对应的位置作为目标位置。Determine the first character position and the first sound source position that are mapped to the preset space coordinate system and meet the preset distance condition, and use the first character and the first sound source position corresponding to the first character position The corresponding second person determines the target person, and uses the corresponding position of the target person in the preset spatial coordinate system as the target position.
这里,所述第一人物表征所述视频数据呈现的人物;所述第二人物表征所述音频数据对应的说话者。Here, the first character represents a character presented by the video data; the second character represents a speaker corresponding to the audio data.
这里,对所述视频数据中至少一帧图像进行图像识别,确定所述视频数据中第一人物和所述第一人物在所述视频数据中的人物位置,可以参考与上述利用所述视频数据执行的人物识别操作,这里不再赘述。Here, image recognition is performed on at least one frame of image in the video data, and the first person in the video data and the person position of the first person in the video data can be determined by referring to the aforementioned use of the video data. The person recognition operation performed will not be repeated here.
这里,所述确定所述音频数据对应的目标声纹特征;根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的第二人物;根据所述音频数据,确定所述音频数据对应的声源位置,可以参考上述利用所述音频数据执行的人物识别操作,这里不再赘述。Here, the target voiceprint feature corresponding to the audio data is determined; the voiceprint database is inquired according to the target voiceprint feature to determine the second person corresponding to the audio data; the audio data is determined according to the audio data For the corresponding sound source location, please refer to the above-mentioned person recognition operation performed by using the audio data, which will not be repeated here.
即,本申请实施例中,将根据所述视频数据得到的人物识别结果(即所述第一人物和所述第一人物在所述视频数据中的人物位置),与利用所述音频数据得到的人物识别结果(即所述第二人物和所述第二人物对应的声源位置)结合起来,以确定目标人物和目标位置。这里,通过将根据所述视频数据得到的人物识别结果与利用所述音频数据得到的人物识别结果结合起来,可以减少人物识别的误差,提高识别的准确率(具体指提高目标人物识别的准确率,以及目标位置识别的准确率)。That is, in the embodiment of the present application, the character recognition result obtained according to the video data (that is, the character position of the first character and the first character in the video data) is combined with the result obtained by using the audio data. The character recognition results (that is, the second character and the sound source position corresponding to the second character) are combined to determine the target character and the target position. Here, by combining the person recognition result obtained from the video data with the person recognition result obtained using the audio data, the error of person recognition can be reduced, and the accuracy of recognition (specifically, the accuracy of target person recognition can be improved) , And the accuracy of target location recognition).
这里,所述映射到所述预设的空间坐标系中的符合预设距离条件的第一人物位置和第一声源位置,包括:Here, the mapping to the first character position and the first sound source position meeting the preset distance condition in the preset space coordinate system includes:
映射到所述预设的空间坐标系中的重合的第一人物位置和第一声源位置,或者,相应两者之间的距离小于预设距离阈值的第一人物位置和第一声源位置。Mapped to the coincident first character position and first sound source position in the preset spatial coordinate system, or correspondingly the first character position and the first sound source position where the distance between the two is less than the preset distance threshold .
即,这里是将所述至少一个第一人物的人物位置和所述声源位置映射到同一坐标系后,确定相同位置(或者相近位置)的第一人物位置(至少一个第一人物的人物位置中的一个位置)和第一声源位置(若计算得到的声源位置仅一个,则这里所述第一声源位置即为计算得到的声源位置;若运用不同的定位算法计算出不同的声源位置,则所述第一声源位置为所述不同的声源位置中的一个);That is, here is after the character position of the at least one first character and the sound source position are mapped to the same coordinate system, the position of the first character (the character position of the at least one first character) at the same position (or a similar position) is determined Position) and the first sound source position (if the calculated sound source position is only one, then the first sound source position mentioned here is the calculated sound source position; if different positioning algorithms are used to calculate different Sound source position, the first sound source position is one of the different sound source positions);
再利用在相同位置(或者是相近位置)的第一人物和第二人物确定目标人物。这里,相近指人物位置和声源位置之间的距离小于预设距离阈值。Then use the first person and the second person in the same location (or a similar location) to determine the target person. Here, close means that the distance between the position of the character and the position of the sound source is less than a preset distance threshold.
这里,映射到所述预设的空间坐标系中的人物位置和声源位置之间的距离不符合预设条件,即位置不重合或不相近,这也可能是因为播放的音频数据和视频数据不同步所导致的,因此,也可以先将音频数据和视频数据同步。具体来说,所述音频数据可以对应第一时间轴,所述视频数据对应可以第二时间轴,所述方法还可以包括:根据所述第一时间轴和所述第二时间轴对齐音频数据和视频数据,使得所述音频数据和所述视频数据同步。针对同步的音频数据和视频数据,确定第一人物和对应的人物位置、第二人物和对应的声源位置后,将所述人物位置和所述声源位置映射到同一坐标系(如预设的空间坐标系)后,从映射到所述预设的空间坐标系中的相同位置(或者是相近位置)的第一人物和第二人物中选择一个人物作为目标人物。Here, the distance between the character position and the sound source position mapped to the preset spatial coordinate system does not meet the preset condition, that is, the positions do not overlap or are not close. This may also be due to the played audio data and video data It is caused by out-of-synchronization. Therefore, it is also possible to synchronize audio data and video data first. Specifically, the audio data may correspond to a first time axis, and the video data may correspond to a second time axis, and the method may further include: aligning audio data according to the first time axis and the second time axis And video data, so that the audio data and the video data are synchronized. For synchronized audio data and video data, after determining the position of the first character and the corresponding character, and the second character and the corresponding sound source position, map the character position and the sound source position to the same coordinate system (such as preset After that, one character is selected as the target character from the first character and the second character that are mapped to the same position (or a similar position) in the preset space coordinate system.
另外,所述预设的空间坐标系上的各位置还可以对应有时间信息,所述时间信息表征人物在相应位置说话的时间信息;相应地,针对所述映射到所述预设的空间坐标系中的符合预设距离条件的第一人物位置和第一声源位置,还可以进一步确定第一人物位置对应的时间和第一声源位置对应的时间是否重合。即所述方法可以包括针对音 频数据和视频数据,确定第一人物和对应的人物位置、第二人物和对应的声源位置后,将所述人物位置和所述声源位置映射到预设的空间坐标系后,从映射到所述预设的空间坐标系中的相同位置(或者是相近位置)、且对应时间重合的第一人物和第二人物中选择一个人物作为目标人物;从而可以减少人物识别的误差,提高目标人物识别的准确率和目标位置识别的准确率。In addition, each position on the preset spatial coordinate system may also correspond to time information, and the time information represents the time information of the person speaking at the corresponding position; accordingly, for the mapping to the preset spatial coordinate If the position of the first character and the position of the first sound source in the system meet the preset distance condition, it can be further determined whether the time corresponding to the position of the first character and the time corresponding to the position of the first sound source coincide. That is, the method may include determining the position of the first character and the corresponding character, the second character and the corresponding sound source position for audio data and video data, and then mapping the character position and the sound source position to a preset After the space coordinate system, select a character as the target character from the first character and the second character that are mapped to the same position (or a similar position) in the preset space coordinate system and overlap in the corresponding time; The error of person recognition improves the accuracy of target person recognition and target position recognition accuracy.
这里,所述预设的空间坐标系可以采用一个新建的空间坐标系,则将所述人物位置和所述声源位置映射到预设的空间坐标系,可以指将声源位置和所述声源位置映射到所述新建的空间坐标系。Here, the preset spatial coordinate system may adopt a newly created spatial coordinate system, and then the character position and the sound source position are mapped to the preset spatial coordinate system, which may refer to the combination of the sound source position and the sound source position. The source position is mapped to the newly created spatial coordinate system.
需要说明的是,所述新建的空间坐标系与现实场景存在对应关系,从而可以将所述声源位置映射到所述新建的空间坐标系中。所述新建的空间坐标系与视频数据中场景存在对应关系,从而可以将人物位置映射到所述新建的空间坐标系中。将所述目标人物在所述预设的空间坐标系中对应的位置作为目标位置,所述目标位置也即对应目标人物在视频数据中位置。It should be noted that there is a correspondence between the newly created spatial coordinate system and the real scene, so that the position of the sound source can be mapped to the newly created spatial coordinate system. The newly created spatial coordinate system has a corresponding relationship with the scene in the video data, so that the position of the character can be mapped to the newly created spatial coordinate system. The position corresponding to the target person in the preset spatial coordinate system is taken as the target position, and the target position is also the position of the target person in the video data.
所述预设的空间坐标系也可以采用上述针对视频数据设有的第一空间坐标系,则将所述人物位置和所述声源位置映射到预设的空间坐标系,指将声源位置映射到所述第一空间坐标系。相应地,将所述目标人物在所述预设的空间坐标系中对应的位置作为目标位置,指将所述目标人物在所述第一空间坐标系中对应的位置作为目标位置。The preset spatial coordinate system may also adopt the above-mentioned first spatial coordinate system set for the video data, and the position of the character and the sound source position are mapped to the preset spatial coordinate system, which means that the position of the sound source is mapped to the preset spatial coordinate system. Mapped to the first spatial coordinate system. Correspondingly, taking the position corresponding to the target person in the preset space coordinate system as the target position means taking the position corresponding to the target person in the first space coordinate system as the target position.
具体地,所述利用所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物确定目标人物,包括:Specifically, the determining the target person by using the first person corresponding to the position of the first person and the second person corresponding to the position of the first sound source includes:
若所述第一人物位置对应的第一人物与所述第一声源位置对应的第二人物相同,将所述第一人物位置对应的第一人物(也即所述第一声源位置对应的第二人物)作为所述目标人物;If the first character corresponding to the first character position is the same as the second character corresponding to the first sound source position, the first character corresponding to the first character position (that is, the first sound source position corresponds to) The second person) as the target person;
若所述第一人物位置对应的第一人物与所述第一声源位置对应的第二人物不相同,获取第一权重和第二权重;所述第一权重表征根据所述视频数据确定的目标人物的可信度,所述第二权重表征根据音频数据确定的目标人物的可信度;根据所述第一权重对所述第一人物位置对应的第一人物的识别结果作加权处理,并根据所述第二权重对所述第一声源位置对应的第二人物的识别结果作加权处理;根据加权处理结果从所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物中选择一个人物作为目标人物。If the first person corresponding to the position of the first person is different from the second person corresponding to the position of the first sound source, the first weight and the second weight are obtained; the first weight represents the value determined according to the video data The credibility of the target person, the second weight represents the credibility of the target person determined according to audio data; the recognition result of the first person corresponding to the position of the first person is weighted according to the first weight, And perform weighting processing on the recognition result of the second person corresponding to the first sound source position according to the second weight; according to the weighted processing result, from the first person corresponding to the first person position and the first sound source Select a character from the second character corresponding to the position as the target character.
所述第一人物位置对应的第一人物的识别结果,可以包括:人物ID、结果值(表征为相应人物的可能性);The recognition result of the first person corresponding to the position of the first person may include: a person ID and a result value (characterized as the possibility of the corresponding person);
所述第一声源位置对应的第二人物的识别结果,可以包括:人物ID、结果值。The recognition result of the second person corresponding to the position of the first sound source may include: a person ID and a result value.
这里,针对第一权重、第二权重可以由开发人员预先确定(具体可以预先针对图像识别、声纹识别的准确率进行多次测试,基于测试结果确定)并保存在电子设备中。Here, the first weight and the second weight may be predetermined by the developer (specifically, the accuracy of image recognition and voiceprint recognition may be tested multiple times in advance and determined based on the test result) and stored in the electronic device.
这里,确定所述第一人物与所述第二人物是否相同,即判断第一人物的人物ID和第二人物的人物ID是否相同。Here, it is determined whether the first person and the second person are the same, that is, it is determined whether the person ID of the first person and the person ID of the second person are the same.
举例来说,若所述第一人物为人物A、第二人物也为人物A,则直接将人物A作为目标人物。For example, if the first character is character A and the second character is also character A, then character A is directly used as the target character.
若所述第一人物为人物A、第二人物为人物B,需要获取第一权重和第二权重;根据所述第一权重对所述人物A的识别结果(包括人物A的ID和对应的结果值)作加权处理,并根据所述第二权重对所述人物B的识别结果(包括人物B的ID和对应的结果值)作加权处理;得到人物A加权结果和人物B加权结果,将人物A加权结果和人物B加权结果进行比较,确定人物A加权结果大于人物B加权结果,则将人物A作为目标人物,反之,则将人物B作为目标人物。If the first person is person A and the second person is person B, the first weight and the second weight need to be obtained; the recognition result of the person A according to the first weight (including the ID of the person A and the corresponding The result value) is weighted, and the recognition result of the person B (including the ID of the person B and the corresponding result value) is weighted according to the second weight; the weighted result of the person A and the weighted result of the person B are obtained, and the The weighted result of character A is compared with the weighted result of character B, and it is determined that the weighted result of character A is greater than the weighted result of character B, then character A is taken as the target character, otherwise, character B is taken as the target character.
这里,考虑到通过对所述视频数据中至少一帧图像进行图像识别,针对某一人物可能识别得到多个识别结果;通过对所述音频数据进行声纹识别,也可能识别得到多个识别结果。Here, it is considered that by performing image recognition on at least one frame of the video data, multiple recognition results may be obtained for a certain person; by performing voiceprint recognition on the audio data, multiple recognition results may also be obtained by recognition .
相应情况下,所述根据所述第一权重对所述第一人物作加权处理,并根据所述第二权重对所述第二人物作加权处理;根据加权处理结果从所述第一人物和所述第二人物中选择一个人物作为目标人物,包括:Correspondingly, the first person is weighted according to the first weight, and the second person is weighted according to the second weight; according to the weighted processing result, the first person and the The selection of one of the second characters as the target character includes:
根据第一权重对至少两个第一识别结果中各第一识别结果分别进行加权处理,得到针对每个第一识别结果的加权处理结果;Perform weighting processing on each of the at least two first recognition results respectively according to the first weight, to obtain a weighted processing result for each first recognition result;
根据第二权重对至少两个第二识别结果中各第二识别结果分别进行加权处理,得到针对每个第二识别结果的加权处理结果;Perform weighting processing on each of the at least two second recognition results respectively according to the second weight, to obtain a weighted processing result for each second recognition result;
确定所述至少两个第一识别结果与所述至少两个第二识别结果不存在同一人物(即至少两个第一识别结果中各人物ID,与至少两个第二识别结果中各人物ID均不相同),则根据每个第一识别结果的加权处理结果和每个第二识别结果的加权处理结果,从至少两个第一识别结果和至少两个第二识别结果中选择加权处理结果最大的人物,作为目标人物;It is determined that the at least two first recognition results and the at least two second recognition results do not have the same person (that is, each person ID in the at least two first recognition results, and each person ID in the at least two second recognition results Are not the same), according to the weighted processing result of each first recognition result and the weighted processing result of each second recognition result, the weighted processing result is selected from at least two first recognition results and at least two second recognition results The biggest person, as the target person;
确定所述至少两个第一识别结果与所述至少两个第二识别结果存在同一人物(即至少两个第一识别结果中存在某个人物ID,与至少两个第二识别结果中存在的某个人物ID相同),则将针对同一人物的加权处理结果相加,得到针对每个人物的加权处理结果;根据每个人物的加权处理结果,从至少两个第一识别结果和至少两个第二识别结果中选择加权处理结果最大的人物,作为目标人物。It is determined that the at least two first recognition results and the at least two second recognition results have the same person (that is, a certain person ID exists in the at least two first recognition results, and the same person exists in the at least two second recognition results If the ID of a certain person is the same), the weighted processing results for the same person are added to obtain the weighted processing result for each person; according to the weighted processing result of each person, from at least two first recognition results and at least two In the second recognition result, the person with the largest weighted processing result is selected as the target person.
以上所述第一识别结果为通过对所述视频数据中至少一帧图像进行图像识别,针对某一人物得到的结果;各第一识别结果包括:人物ID、对应的结果值。The above-mentioned first recognition result is a result obtained by performing image recognition on at least one frame of image in the video data for a certain person; each first recognition result includes: a person ID and a corresponding result value.
所述第二识别结果指通过对所述音频数据进行声纹识别,识别得到的结果;各第二识别结果包括:人物ID、对应的结果值。The second recognition result refers to a result obtained by performing voiceprint recognition on the audio data; each second recognition result includes: a person ID and a corresponding result value.
举例来说,通过对所述视频数据中至少一帧图像进行图像识别,针对某一人物可能识别得到多个识别结果,如:识别得到人物A、人物B,并得到人物A为目标人物的可能性为a1%(即一种第一识别结果)、人物B为目标人物的可能性为b%(即另一种第一识别结果);这里,a1%+b%可以等于1;For example, by performing image recognition on at least one frame of image in the video data, multiple recognition results can be obtained for a certain person, such as: person A and person B are recognized, and the possibility that person A is the target person is obtained The sex is a1% (that is, a first recognition result), and the probability that person B is the target person is b% (that is, another first recognition result); here, a1%+b% can be equal to 1;
通过对所述音频数据进行声纹识别,可能识别得到多个识别结果,如:识别得到人物A、人物C,并得到人物A为目标人物的可能性为a2%(即一种第二识别结果)、人物C为目标人物的可能性为c%(即另一种第二识别结果);这里,a2%+c%可以等于1;By performing voiceprint recognition on the audio data, it is possible to recognize multiple recognition results, such as: person A and person C are recognized, and the probability that person A is the target person is a2% (that is, a second recognition result ). The probability that person C is the target person is c% (that is, another second recognition result); here, a2%+c% can be equal to 1;
假设第一权重为x,第二权重为y,且x+y=1;Suppose the first weight is x, the second weight is y, and x+y=1;
针对每个人物的识别结果的加权处理结果如下:The weighted processing results for the recognition results of each person are as follows:
人物A的识别结果的加权处理结果:a1%*x+a2%*y;The weighted processing result of the recognition result of person A: a1%*x+a2%*y;
人物B的识别结果的加权处理结果:b%*x;Weighted processing result of the recognition result of person B: b%*x;
人物C的识别结果的加权处理结果:c%*y;Weighted processing result of the recognition result of person C: c%*y;
从人物A、人物B、人物C中选择加权处理结果最大的人物,作为目标人物。The person with the largest weighted processing result is selected from the person A, the person B, and the person C as the target person.
以上所述的可能性表征识别结果,针对运用图像识别模型识别人物图像(具体可以指面部图像)得到的识别结果举例来说,图像识别模型识别某一人物图像,确定所述人物图像中的人为人物A,为人物A的可能性为a1%;这里,所述a1%表征所述人物图像与图像识别模型采用的人物A的面部图像的相似度为a1%;The above-mentioned possibility characterizes the recognition result, for the recognition result obtained by using the image recognition model to recognize a person image (specifically, a facial image). For example, the image recognition model recognizes a certain person image and determines the person in the person image Person A, the probability of being person A is a1%; here, the a1% characterizes the similarity between the person image and the facial image of person A adopted by the image recognition model as a1%;
针对运用声纹识别模型识别声音得到的识别结果举例来说,声纹识别模型识别音频数据,确定所述音频数据对应的人为人物C,为人物C的可能性为c%;这里,所 述c%表征音频数据对应的声纹特征与声纹识别模型采用的人物C的声纹模型的相似度为c%。For the recognition result obtained by using the voiceprint recognition model to recognize sounds, for example, the voiceprint recognition model recognizes audio data, and determines that the person corresponding to the audio data is person C, and the probability of being person C is c%; here, the c % Characterizes the similarity between the voiceprint feature corresponding to the audio data and the voiceprint model of the person C used in the voiceprint recognition model as c%.
实际应用中,为了可以让用户直观的了解到说话者所说内容,需建立语音内容与对应说话者对应关系并显示。In practical applications, in order to allow the user to intuitively understand what the speaker is saying, the corresponding relationship between the voice content and the corresponding speaker needs to be established and displayed.
基于此,在步骤303中,所述在所述视频数据中添加针对所述目标人物的标记,包括以下之一:Based on this, in step 303, adding a mark for the target person to the video data includes one of the following:
在所述视频数据中所述目标位置相应的位置(如在所述目标位置的左上角、右上角等),以浮动窗口或固定窗口的形式呈现标记;所述识别文本添加在所述标记内;At the position corresponding to the target position in the video data (such as the upper left corner, upper right corner, etc.) of the target position, the mark is presented in the form of a floating window or a fixed window; the recognition text is added in the mark ;
在所述视频数据中所述目标位置相应的位置(如在所述目标位置的上方、下方、左上角、右上角等),添加针对目标人物的预设标记;所述预设标记包括以下之一:预设图标、预设图像、针对目标人物的轮廓框架;所述识别文本在所述视频数据的预设位置进行呈现。In the video data corresponding to the target position (such as above, below, upper left corner, upper right corner, etc.) of the target position, add a preset mark for the target person; the preset mark includes one of the following One: a preset icon, a preset image, and an outline frame for the target person; the recognized text is presented at a preset position of the video data.
这里,所述识别文本为对所述目标人物对应的音频数据进行文本识别得到的;所述识别文本根据所述目标人物对应的音频数据的变化而变化。Here, the recognized text is obtained by performing text recognition on audio data corresponding to the target person; the recognized text changes according to changes in the audio data corresponding to the target person.
这里,所述浮动窗口表征在播放中的视频数据中可移动的窗口;需要说明的是,浮动窗口的可移动范围可以限定在所述目标人物周围的一定范围内,从而,即使移动浮动窗口后,用户也可以将目标人物和识别文本对应。Here, the floating window represents a movable window in the video data being played; it should be noted that the movable range of the floating window can be limited to a certain range around the target person, so that even after the floating window is moved , The user can also associate the target person with the recognized text.
所述固定窗口表征在播放中的视频数据中相对于目标人物的位置固定不动的窗口。The fixed window represents a window that is fixed relative to the position of the target person in the video data being played.
本申请实施例中,所述标记可以采用任意一种窗口形式呈现,例如可以采用图4中的任意一种标记。所述窗口的箭头指向对应的说话者、即目标人物,窗口内用于呈现同声传译的内容、即所述识别文本。In the embodiment of the present application, the mark may be presented in any window form, for example, any mark in FIG. 4 may be used. The arrow of the window points to the corresponding speaker, that is, the target person, and the window is used to present the content of simultaneous interpretation, that is, the recognized text.
这里,所述识别文本可以是基于音频数据得到的文本;所述识别文本对应的语种可以与音频数据对应的语种相同,也可以与所述音频数据对应的语种不同,即所述识别文本也可以是对于所述音频数据相同语种的文本进行翻译后的其他语种的文本。Here, the recognized text may be text obtained based on audio data; the language corresponding to the recognized text may be the same as the language corresponding to the audio data, or may be different from the language corresponding to the audio data, that is, the recognized text may also be It is a text in another language after translating the text in the same language of the audio data.
这里,所述预设图标,可以采用:三角形、五角星、圆形等图形标记;Here, the preset icon can be: a triangle, a five-pointed star, a circle, and other graphic marks;
所述预设图像,可以采用:目标人物的面部图像、人物身份图像(如表征主持人、表征被采访者等身份的图像)等;The preset image may be: a facial image of a target person, an identity image of a person (such as an image that characterizes the identity of the host, the interviewee, etc.), etc.;
所述轮廓框架,可以采用虚线、实线等线条绘制,例如,所述轮廓框架可以是针对目标人物绘制的虚线框;The outline frame may be drawn using lines such as dashed lines, solid lines, etc., for example, the outline frame may be a dashed frame drawn for a target person;
在针对目标人物添加预设标记的情况下,所述识别文本可以在所述视频数据的预设位置进行呈现;呈现识别文本的预设位置可以预先设置,如视频下方等位置。In the case of adding a preset mark to the target person, the recognized text may be presented at a preset position of the video data; the preset position for presenting the recognized text may be preset, such as a position below the video.
这里,所述标记还可以针对显示的识别文本进行添加。具体来说,在所述视频数据中添加针对所述目标人物的标记,包括:Here, the mark can also be added to the displayed recognized text. Specifically, adding a mark for the target person to the video data includes:
在所述视频数据的预设位置呈现识别文本;Presenting the recognized text at a preset position of the video data;
在所述识别文本的头部呈现目标人物的身份标识;所述身份标识可以采用以下至少之一:面部图像、人物名称、人物身份(如主持人、被采访者)。The identification of the target person is presented at the head of the recognition text; the identification may adopt at least one of the following: a facial image, a person's name, and a person's identity (such as a host, an interviewee).
用户通过识别文本和识别文本对应的人物的身份标识,也可将目标人物与目标人物所说的内容对应。The user can also match the target person with the content said by the target person by recognizing the text and the identification of the person corresponding to the recognition text.
应理解,上述实施例中说明各步骤(如确定第一人物、确定第二人物、确定第一人物对应的人物位置、确定第二人物对应的声源位置等)的顺序并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the order of the steps described in the above embodiments (such as determining the first character, determining the second character, determining the position of the character corresponding to the first character, determining the position of the sound source corresponding to the second character, etc.) does not mean the order of execution The sequence of execution of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
本申请实施例提供的数据处理方法,获取目标数据;所述目标数据包括视频数据和 音频数据;利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现;如此,能够确定识别文本对应的目标人物,并将识别文本与目标人物对应标记,便于用户将目标人物与目标人物所说的内容对应,结合目标人物的身份理解目标人物所说的内容,从而能够准确帮助用户理解所说内容,提升用户体验。The data processing method provided by the embodiment of the application obtains target data; the target data includes video data and audio data; the target data is used to perform a character recognition operation for a target person, and the target person corresponds to the target data in the target data. The speaker; add a mark to the target person in the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played; in this way, the target corresponding to the recognized text can be determined Characters, and mark the recognition text corresponding to the target person, so that users can correspond to the content of the target person's words, and combine the identity of the target person to understand the content of the target person, so as to accurately help users understand the content and improve user experience.
图5为本申请实施例提供的数据处理方法的另一种流程示意图;如图5所示,所述数据处理方法,包括:FIG. 5 is a schematic flowchart of another data processing method provided by an embodiment of the application; as shown in FIG. 5, the data processing method includes:
步骤501:音频采集模块采集音频数据。Step 501: The audio collection module collects audio data.
步骤502:对所述音频数据进行声纹识别,根据识别得到的目标声纹特征确定第一说话者;并对所述音频数据进行声源定位,确定声源位置。Step 502: Perform voiceprint recognition on the audio data, determine the first speaker according to the identified target voiceprint characteristics; perform sound source localization on the audio data, and determine the sound source location.
这里,所述根据识别得到的声纹确定第一说话者,包括:Here, the determining the first speaker according to the recognized voiceprint includes:
根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的第一说话者。The voiceprint database is queried according to the target voiceprint feature, and the first speaker corresponding to the audio data is determined.
这里,所述声纹数据库可以采用图3所示方法中的声纹数据库;所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息。所述声纹数据库的构建方法可以采用图3所示方法中相同的过程,这里不再赘述。Here, the voiceprint database may adopt the voiceprint database in the method shown in FIG. 3; the voiceprint database includes at least one voiceprint model and character information corresponding to the at least one voiceprint model. The method for constructing the voiceprint database can adopt the same process as in the method shown in FIG. 3, which will not be repeated here.
这里,所述音频采集模块可以是麦克风阵列;所述对音频数据进行声源定位,包括:根据麦克风阵列的多声道音频数据,计算出声源位置。所述声源位置即所述第一说话者在现实场景下的位置。Here, the audio collection module may be a microphone array; the sound source localization of the audio data includes: calculating the sound source position according to the multi-channel audio data of the microphone array. The sound source position is the position of the first speaker in a real scene.
这里,所述第一说话者相当于图3方法中的第二人物;所述步骤502的识别操作,可以参考图3所示方法中针对音频数据的人物识别操作,这里不再赘述。Here, the first speaker is equivalent to the second person in the method shown in FIG. 3; for the recognition operation in step 502, reference may be made to the person recognition operation for audio data in the method shown in FIG. 3, which will not be repeated here.
步骤503:视频采集模块采集视频数据。Step 503: The video acquisition module collects video data.
这里,所述视频采集模块可以是摄像头,所述视频采集模块采集视频数据,包括:通过摄像头拍摄相应现实场景下特定位置的视频;拍摄的视频中包含至少一个说话者。Here, the video collection module may be a camera, and the video collection module collecting video data includes: shooting a video of a specific location in a corresponding real scene through the camera; and the captured video includes at least one speaker.
步骤504:对所述视频数据中至少一帧图像进行人脸识别,确定至少一个第二说话者和至少一个第二说话者对应的人物位置。Step 504: Perform face recognition on at least one frame of image in the video data, and determine the position of the at least one second speaker and the person corresponding to the at least one second speaker.
这里,可以运用预设的图像识别模型对所述视频数据中的至少一帧图像进行图像识别,确定至少一个第二说话者;并且确定所述至少一个第二说话者在视频数据中的人物位置。Here, a preset image recognition model may be used to perform image recognition on at least one frame of image in the video data, to determine at least one second speaker; and to determine the character position of the at least one second speaker in the video data .
具体来说,针对所述视频数据可以预置一个第一空间坐标系,根据所述第二说话者在所述视频数据中的相应位置,可以确定第二说话者在第一空间坐标系中的位置,作为所述人物位置。Specifically, a first spatial coordinate system can be preset for the video data, and according to the corresponding position of the second speaker in the video data, the position of the second speaker in the first spatial coordinate system can be determined. The position is used as the position of the character.
视频数据中每帧图像为相同形状(如方形)的图像,所述第一空间坐标系与图像存在映射关系,基于所述第二说话者在图像中的位置,即可确定所述第二说话者在所述第一空间坐标系中的位置。Each frame of image in the video data is an image of the same shape (such as a square), and the first spatial coordinate system has a mapping relationship with the image. Based on the position of the second speaker in the image, the second speech can be determined The position of the person in the first spatial coordinate system.
这里,持续对视频数据中每帧图像进行识别,即可实现人脸识别与跟踪(即实时确定第二说话者和第二说话者对应的人物位置)。Here, by continuously recognizing each frame of image in the video data, face recognition and tracking can be realized (that is, the second speaker and the position of the person corresponding to the second speaker are determined in real time).
需要说明的是,运用图像识别(具体可以指人脸识别)技术识别出人物和人物的位置后,还可以利用目标跟踪(Tracking Technique)技术对人物进行跟踪,即使失去人脸图像(如低头、转头等情况)时,也可同步更新人物位置。具体来说,在视频数据的部分帧中确定目标人物和目标位置后,目标人物可能走动,相应的目标人物对应的标记(如预设图标、针对目标人物的轮廓框架)也应随着目标人物的目标位置的变动而变动;这里,采用所述目标跟踪技术在后续的视频数据的图像帧中可以对目标人物进行跟踪,即时确定改变后的目标位置,对应添加针对目标人物的标记,最终实现了标记跟随目标人 物的目标位置的变动而变动的效果。It should be noted that after the use of image recognition (specifically, face recognition) technology to identify the person and the location of the person, the person can also be tracked by the tracking technique technology, even if the face image is lost (such as lowering the head, When turning the head, etc.), the position of the character can also be updated synchronously. Specifically, after the target person and the target position are determined in some frames of the video data, the target person may move around, and the corresponding mark of the target person (such as the preset icon, the outline frame for the target person) should also follow the target person The target position changes and changes; here, the target person can be tracked in the image frame of the subsequent video data by using the target tracking technology, the changed target position can be determined instantly, and the mark for the target person can be added correspondingly, and finally realized The effect that the mark changes with the change of the target position of the target person.
这里,所述第二说话者相当于图3方法中的第一人物;所述步骤504的识别操作,可以参考图3所示方法中针对视频数据的人物识别操作,这里不再赘述。Here, the second speaker is equivalent to the first person in the method in FIG. 3; for the recognition operation in step 504, reference may be made to the person recognition operation on video data in the method shown in FIG. 3, which will not be repeated here.
步骤505:根据所述第一说话者、所述声源位置、所述第二说话者和人物位置,确定目标说话者和目标位置。Step 505: Determine the target speaker and the target position according to the first speaker, the sound source position, the second speaker and the character position.
这里,根据声源位置和人物位置,对齐并确定出目标说话者;这里结合声纹识别得到的说话者的信息(即表示说话者的ID),进行加权融合,可以得到更加准确、稳定的说话人ID。Here, according to the position of the sound source and the position of the person, the target speaker is aligned and determined; here, combined with the speaker information obtained by voiceprint recognition (that is, the ID of the speaker), weighted fusion is performed, and a more accurate and stable speech can be obtained. Person ID.
所述根据所述第一说话者、所述声源位置、所述第二说话者和人物位置,确定目标说话者和目标位置,包括:The determining the target speaker and the target position according to the first speaker, the sound source position, the second speaker, and the character position includes:
将至少二个说话者对应的人物位置和第一说话者对应的声源位置映射到预设的空间坐标系;Mapping character positions corresponding to at least two speakers and a sound source position corresponding to the first speaker to a preset spatial coordinate system;
确定映射到所述预设的空间坐标系中符合预设距离条件的第一人物位置和第一声源位置,利用所述第一声源位置对应的第一说话者和所述第一人物位置对应的第二说话者确定目标说话者,并将所述目标说话者在预设的空间坐标系中对应的位置作为目标位置。Determine the first character position and the first sound source position that are mapped to the preset space coordinate system and meet the preset distance condition, and use the first speaker and the first character position corresponding to the first sound source position The corresponding second speaker determines the target speaker, and uses the corresponding position of the target speaker in the preset spatial coordinate system as the target position.
这里,利用所述第一声源位置对应的第一说话者和所述第一人物位置对应的第二说话者确定目标说话者,包括:若所述第一声源位置对应的第一说话者与所述第一人物位置对应的第二说话者相同,将所述第一声源位置对应的第一说话者作为所述目标说话者;若所述第一声源位置对应的第一说话者与所述第一人物位置对应的第二说话者不相同,获取第一权重和第二权重;所述第一权重表征根据所述视频数据确定的目标说话者的可信度,所述第二权重表征根据音频数据确定的目标说话者的可信度;根据所述第二权重对所述第一声源位置对应的第一说话者的识别结果作加权处理,并根据所述第一权重对所述第一人物位置对应的第二说话者的识别结果作加权处理;根据加权处理结果从所述第一声源位置对应的第一说话者和所述第一人物位置对应的第二说话者中选择一个说话者作为目标说话者。Here, using the first speaker corresponding to the first sound source position and the second speaker corresponding to the first character position to determine the target speaker includes: if the first speaker corresponding to the first sound source position The second speaker corresponding to the position of the first character is the same, and the first speaker corresponding to the position of the first sound source is taken as the target speaker; if the position of the first sound source corresponds to the first speaker The second speaker corresponding to the position of the first character is different, the first weight and the second weight are obtained; the first weight represents the credibility of the target speaker determined according to the video data, and the second The weight characterizes the credibility of the target speaker determined according to the audio data; the recognition result of the first speaker corresponding to the first sound source position is weighted according to the second weight, and the recognition result of the first speaker corresponding to the first sound source position is weighted according to the first weight. The recognition result of the second speaker corresponding to the first character position is weighted; according to the weighted processing result, the first speaker corresponding to the first sound source position and the second speaker corresponding to the first character position Choose a speaker as the target speaker.
这里,所述目标说话者在所述预设的空间坐标系中对应的位置,即为目标位置。所述预设的空间坐标系可以采用一个新建的空间坐标系,则将所述人物位置和所述声源位置映射到预设的空间坐标系,可以指将声源位置和所述声源位置映射到所述新建的空间坐标系。所述新建的空间坐标系与视频数据中场景存在对应关系,从而所述人物位置可以映射到所述新建的空间坐标系中。所述新建的空间坐标系与现实场景也存在对应关系,从而可以将所述声源位置映射到所述新建的空间坐标系中。Here, the corresponding position of the target speaker in the preset spatial coordinate system is the target position. The preset spatial coordinate system may adopt a newly created spatial coordinate system, and then the character position and the sound source position are mapped to the preset spatial coordinate system, which may refer to the combination of the sound source position and the sound source position Mapped to the newly created spatial coordinate system. There is a correspondence between the newly created spatial coordinate system and the scene in the video data, so that the position of the character can be mapped to the newly created spatial coordinate system. The newly created space coordinate system also has a corresponding relationship with the real scene, so that the position of the sound source can be mapped to the newly created space coordinate system.
所述预设的空间坐标系也可以是针对视频数据构建的空间坐标系(即上述第一空间坐标系),所述人物位置即可表征所述第二说话者在所述预设的空间坐标系中的位置。将所述声源位置映射到预设的空间坐标系,包括:基于位置映射关系,确定所述声源位置对应于视频数据中的位置,即所述第一说话者对应的位置。The preset spatial coordinate system may also be a spatial coordinate system constructed for video data (that is, the above-mentioned first spatial coordinate system), and the position of the character can represent the second speaker in the preset spatial coordinate The position in the department. Mapping the sound source position to a preset spatial coordinate system includes: determining that the sound source position corresponds to a position in the video data based on a position mapping relationship, that is, a position corresponding to the first speaker.
这里,所述步骤505的操作可以参考图3中利用所述视频数据和所述音频数据执行的人物识别操作,这里不再赘述。Here, the operation of the step 505 may refer to the person recognition operation performed by using the video data and the audio data in FIG. 3, which will not be repeated here.
步骤506:对所述音频数据进行语音识别,得到源语言文字。Step 506: Perform voice recognition on the audio data to obtain the source language text.
步骤507:对所述源语言文字进行机器翻译,得到目标语言文字。Step 507: Perform machine translation on the source language text to obtain the target language text.
步骤508:将所述目标语言文字通过屏幕进行展示。Step 508: Display the target language text on the screen.
这里,所述将所述目标语言文字通过屏幕进行展示,包括:Here, the display of the target language text on the screen includes:
根据所述目标位置针对目标说话者在播放的视频数据中添加标记,所述标记内呈现有所述说话者所说内容对应的文本。A mark is added to the played video data for the target speaker according to the target location, and the text corresponding to the content said by the speaker is presented in the mark.
这里,所述标记在所述视频数据被播放时以窗口方式呈现;所述窗口内呈现基于所述音频数据确定的识别文本。Here, the mark is presented in a window when the video data is played; the recognized text determined based on the audio data is presented in the window.
在视频数据的播放过程中,针对说话者对应展示所说内容对应的文本;从而以对话形式展示每个说话者所说的内容。During the playback of the video data, the text corresponding to the said content is displayed corresponding to the speaker; thus, the content said by each speaker is displayed in the form of dialogue.
所述窗口可以是浮动窗口或固定窗口,所述窗口具有明显的指向性,以表明相应内容的说话者,这里,指向性的对话窗口,可以如图4所示。The window may be a floating window or a fixed window, and the window has obvious directivity to indicate the speaker of the corresponding content. Here, the directivity dialogue window may be as shown in FIG. 4.
步骤509:根据所述目标语言文字合成语音,并播放合成的语音。Step 509: Synthesize speech according to the target language text, and play the synthesized speech.
这里,所述根据目标语言文字合成语音,并播放合成的语音,包括:针对目标语言文字(即对音频数据对应的文字进行翻译,得到所述音频数据对应的其他语种的文字)进行语音合成,得到合成后的语音,即为所述音频数据对应的翻译语音;在播放视频数据时,可以同时播放翻译语音。Here, the synthesizing speech according to the target language text and playing the synthesized speech includes: performing speech synthesis for the target language text (that is, translating the text corresponding to the audio data to obtain text in other languages corresponding to the audio data), The synthesized voice is the translated voice corresponding to the audio data; when the video data is played, the translated voice can be played at the same time.
本申请实施例提供的数据处理方法结合音频数据和视频数据进行人物识别,确定说话者(即目标人物),消除识别噪声,提高识别的准确率和运用所述数据处理方法的***的鲁棒性,减少了应用限制,增加了应用范围;另外,通过向用户展示说话者与同传内容(即说话者说的话)的对应关系,降低了内容理解难度,通过对话式的动态呈现方式改善了观众的感官体验。The data processing method provided by the embodiments of the application combines audio data and video data to perform person recognition, determines the speaker (ie, the target person), eliminates recognition noise, improves the accuracy of recognition and the robustness of the system using the data processing method , Which reduces application restrictions and increases the scope of application; in addition, by showing users the correspondence between the speaker and the simultaneous interpretation content (that is, what the speaker says), it reduces the difficulty of content understanding, and improves the audience through dialogue-style dynamic presentation Sensory experience.
本申请实施例的数据处理方法,可以应用于电子设备,具体可以采用图3所示方法应用的电子设备,实际应用时,所述电子设备的相关操作可以参考图3方法中的说明,这里不再赘述。The data processing method of the embodiment of the present application can be applied to electronic equipment, specifically, the electronic equipment applied by the method shown in FIG. 3 may be used. In actual application, the relevant operations of the electronic equipment may refer to the description in the method of FIG. Go into details again.
需要说明的是,所述音频采集模块和所述视频采集模块可以是服务器连接或自身设有的模块;当然,也可以是其他独立的模块,所述音频采集模块和所述视频采集模块采集相应数据后发送给服务器,所述服务器进行相应的数据处理。It should be noted that the audio collection module and the video collection module may be connected to the server or modules provided by themselves; of course, they may also be other independent modules. The audio collection module and the video collection module collect correspondingly The data is then sent to the server, and the server performs corresponding data processing.
为实现本申请实施例的数据处理方法,本申请实施例还提供了一种数据处理装置。图6为本申请实施例的数据处理装置的组成结构示意图;如图6所示,所述数据处理装置包括:In order to implement the data processing method of the embodiment of the present application, the embodiment of the present application also provides a data processing device. FIG. 6 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application; as shown in FIG. 6, the data processing device includes:
获取单元61,配置为获取目标数据;所述目标数据包括视频数据和音频数据;The obtaining unit 61 is configured to obtain target data; the target data includes video data and audio data;
第一处理单元62,配置为利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;The first processing unit 62 is configured to use the target data to perform a person recognition operation for a target person, the target person corresponding to the speaker in the target data;
第二处理单元63,配置为在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。The second processing unit 63 is configured to add a mark for the target person to the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played.
在一实施例中,第一处理单元62,配置为根据所述音频数据识别所述目标人物;根据所述目标数据识别所述目标人物的目标位置;In an embodiment, the first processing unit 62 is configured to recognize the target person according to the audio data; recognize the target position of the target person according to the target data;
所述第二处理单元63,配置为根据所述目标位置在所述视频数据中添加针对所述目标人物的标记。The second processing unit 63 is configured to add a mark for the target person to the video data according to the target position.
在一实施例中,所述第一处理单元62,配置为确定所述音频数据对应的目标声纹特征;In an embodiment, the first processing unit 62 is configured to determine the target voiceprint feature corresponding to the audio data;
根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的目标人物;所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;Query a voiceprint database according to the target voiceprint feature to determine a target person corresponding to the audio data; the voiceprint database includes at least one voiceprint model and person information corresponding to the at least one voiceprint model;
从图像数据库中确定所述目标人物对应的面部图像,根据所述面部图像对所述视频数据中的至少一帧图像进行图像识别,确定所述目标人物在所述视频数据中的目标位置;所述图像数据库包括至少一个人物的面部图像和所述至少一个人物的面部图像对应的人物信息。Determine the facial image corresponding to the target person from the image database, perform image recognition on at least one frame of the video data according to the facial image, and determine the target position of the target person in the video data; The image database includes a facial image of at least one person and character information corresponding to the facial image of the at least one person.
在一实施例中,所述第一处理单元62,配置为确定所述音频数据对应的目标声纹特征;In an embodiment, the first processing unit 62 is configured to determine the target voiceprint feature corresponding to the audio data;
根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的目标人物;所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;Query a voiceprint database according to the target voiceprint feature to determine a target person corresponding to the audio data; the voiceprint database includes at least one voiceprint model and person information corresponding to the at least one voiceprint model;
确定所述音频数据对应的声源位置;Determine the sound source location corresponding to the audio data;
基于位置映射关系,确定所述声源位置对应于所述视频数据中的目标位置;所述位置映射关系表征现实场景与视频数据中场景的位置映射关系。Based on the location mapping relationship, it is determined that the sound source location corresponds to the target location in the video data; the location mapping relationship represents a location mapping relationship between a real scene and a scene in the video data.
在一实施例中,所述第一处理单元62,配置为根据所述视频数据和所述音频数据识别所述目标人物和所述目标人物的目标位置;In an embodiment, the first processing unit 62 is configured to identify the target person and the target position of the target person according to the video data and the audio data;
所述第二处理单元63,配置为根据所述目标位置在所述视频数据中添加针对所述目标人物的标记。The second processing unit 63 is configured to add a mark for the target person to the video data according to the target position.
在一实施例中,所述第一处理单元62,配置为对所述视频数据中至少一帧图像进行图像识别,确定所述视频数据中至少一个第一人物和所述至少一个第一人物在所述视频数据中的人物位置;In an embodiment, the first processing unit 62 is configured to perform image recognition on at least one frame of image in the video data, and determine whether at least one first person and the at least one first person in the video data are The position of the person in the video data;
确定所述音频数据对应的目标声纹特征;根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的第二人物;根据所述音频数据,确定所述音频数据对应的声源位置;Determine the target voiceprint feature corresponding to the audio data; query the voiceprint database according to the target voiceprint feature to determine the second person corresponding to the audio data; determine the sound source corresponding to the audio data according to the audio data position;
将所述至少一个第一人物对应的人物位置和所述声源位置映射到预设的空间坐标系;所述预设的空间坐标系与现实场景存在位置映射关系,且所述预设的空间坐标系与所述视频数据中场景存在位置映射关系;The character position corresponding to the at least one first character and the sound source position are mapped to a preset space coordinate system; there is a position mapping relationship between the preset space coordinate system and the real scene, and the preset space There is a position mapping relationship between the coordinate system and the scene in the video data;
确定映射到所述预设的空间坐标系中的符合预设距离条件的第一人物位置和第一声源位置,利用所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物确定目标人物,并将所述目标人物在所述预设的空间坐标系中对应的位置作为目标位置;Determine the first character position and the first sound source position that are mapped to the preset space coordinate system and meet the preset distance condition, and use the first character and the first sound source position corresponding to the first character position The corresponding second person determines the target person, and uses the corresponding position of the target person in the preset spatial coordinate system as the target position;
其中,所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;所述图像数据库包括至少一个人物的面部图像和所述至少一个人物的面部图像对应的人物信息。Wherein, the voiceprint database includes at least one voiceprint model and character information corresponding to the at least one voiceprint model; the image database includes a facial image of at least one character and character information corresponding to the facial image of the at least one character .
这里,所述音频数据通过麦克风阵列采集得到。Here, the audio data is collected through a microphone array.
在一实施例中,所述第一处理单元62,配置为若所述第一人物位置对应的第一人物与所述第一声源位置对应的第二人物相同,将所述第一人物位置对应的第一人物作为所述目标人物;In an embodiment, the first processing unit 62 is configured to, if the first character corresponding to the first character position is the same as the second character corresponding to the first sound source position, set the first character position The corresponding first person is the target person;
若所述第一人物位置对应的第一人物与所述第一声源位置对应的第二人物不相同,获取第一权重和第二权重;所述第一权重表征根据所述视频数据确定的目标人物的可信度,所述第二权重表征根据音频数据确定的目标人物的可信度;根据所述第一权重对所述第一人物位置对应的第一人物的识别结果作加权处理,并根据所述第二权重对所述第一声源位置对应的第二人物的识别结果作加权处理;根据加权处理结果从所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物中选择一个人物作为目标人物。If the first person corresponding to the position of the first person is different from the second person corresponding to the position of the first sound source, the first weight and the second weight are obtained; the first weight represents the value determined according to the video data The credibility of the target person, the second weight represents the credibility of the target person determined according to audio data; the recognition result of the first person corresponding to the position of the first person is weighted according to the first weight, And perform weighting processing on the recognition result of the second person corresponding to the first sound source position according to the second weight; according to the weighted processing result, from the first person corresponding to the first person position and the first sound source Select a character from the second character corresponding to the position as the target character.
在一实施例中,所述第一处理单元62,配置为获取麦克风阵列中各声道的声音数据;根据所述各声道的声音数据进行位置计算,得到所述声源位置。In an embodiment, the first processing unit 62 is configured to obtain sound data of each channel in the microphone array; perform position calculation according to the sound data of each channel to obtain the sound source position.
在一实施例中,所述第二处理单元63,配置为采用以下之一的方式在所述视频数据中添加针对所述目标人物的标记:In an embodiment, the second processing unit 63 is configured to add a mark for the target person to the video data in one of the following ways:
在所述视频数据中所述目标位置相应的位置,以浮动窗口或固定窗口的形式呈现标记;所述识别文本添加在所述标记内;At a position corresponding to the target position in the video data, a mark is presented in the form of a floating window or a fixed window; the recognition text is added to the mark;
在所述视频数据中所述目标位置相应的位置,添加针对目标人物的预设标记;所述预设标记包括以下之一:预设图标、预设图像、针对目标人物的轮廓框架;所述识 别文本在所述视频数据的预设位置进行呈现。At the position corresponding to the target position in the video data, a preset mark for the target person is added; the preset mark includes one of the following: a preset icon, a preset image, and an outline frame for the target person; The recognized text is presented at a preset position of the video data.
实际应用时,所述第一处理单元62、所述第二处理单元63均可由所述电子设备(如服务器、终端)中的处理器,比如中央处理器(CPU,Central Processing Unit)、数字信号处理器(DSP,Digital Signal Processor)、微控制单元(MCU,Microcontroller Unit)或可编程门阵列(FPGA,Field-Programmable Gate Array)等实现。所述获取单元61可由电子设备中的通信接口实现。In practical applications, the first processing unit 62 and the second processing unit 63 can be implemented by processors in the electronic devices (such as servers and terminals), such as central processing units (CPUs, Central Processing Unit), digital signal processing units, etc. Processor (DSP, Digital Signal Processor), Microcontroller Unit (MCU, Microcontroller Unit) or Programmable Gate Array (FPGA, Field-Programmable Gate Array), etc. are implemented. The acquiring unit 61 may be implemented by a communication interface in an electronic device.
需要说明的是:上述实施例提供的装置在进行数据处理时,仅以上述各程序模块的划分进行举例说明,实际应用中,可以根据需要而将上述处理分配由不同的程序模块完成,即将终端的内部结构划分成不同的程序模块,以完成以上描述的全部或者部分处理。另外,上述实施例提供的装置与数据处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the device provided in the above embodiment performs data processing, only the division of the above-mentioned program modules is used as an example. In actual applications, the above-mentioned processing can be allocated by different program modules as needed, that is, the terminal The internal structure is divided into different program modules to complete all or part of the processing described above. In addition, the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
基于上述设备的硬件实现,本申请实施例还提供了一种电子设备,图7为本申请实施例的电子设备的硬件组成结构示意图,如图7所示,电子设备70包括存储器73、处理器72及存储在存储器73上并可在处理器72上运行的计算机程序;位于电子设备的处理器72执行所述程序时实现上述电子设备侧一个或多个技术方案提供的方法。Based on the hardware implementation of the above device, an embodiment of the present application also provides an electronic device. FIG. 7 is a schematic diagram of the hardware composition structure of the electronic device according to the embodiment of the application. As shown in FIG. 7, the electronic device 70 includes a memory 73 and a processor. 72 and a computer program stored in the memory 73 and capable of running on the processor 72; the processor 72 located in the electronic device executes the program to implement the method provided by one or more technical solutions on the electronic device side.
具体地,位于电子设备70的处理器72执行所述程序时实现:获取目标数据;所述目标数据包括视频数据和音频数据;利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。Specifically, when the processor 72 located in the electronic device 70 executes the program, it achieves: acquiring target data; the target data includes video data and audio data; using the target data to perform a character recognition operation for a target person, the target The person corresponds to the speaker in the target data; a tag for the target person is added to the video data; the tag and the recognized text determined based on the audio data are presented when the video data is played.
需要说明的是,位于电子设备70的处理器72执行所述程序时实现的具体步骤已在上文详述,这里不再赘述。It should be noted that the specific steps implemented when the processor 72 of the electronic device 70 executes the program have been described in detail above, and will not be repeated here.
可以理解,电子设备还包括通信接口71;电子设备中的各个组件通过总线***74耦合在一起。可理解,总线***74配置为实现这些组件之间的连接通信。总线***74除包括数据总线之外,还包括电源总线、控制总线和状态信号总线等。It can be understood that the electronic device further includes a communication interface 71; various components in the electronic device are coupled together through the bus system 74. It can be understood that the bus system 74 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 74 also includes a power bus, a control bus, and a status signal bus.
可以理解,本实施例中的存储器73可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory 73 in this embodiment may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM, Read Only Memory), programmable read-only memory (PROM, Programmable Read-Only Memory), and erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage. The volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory), and dynamic random access memory. Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), enhanced Type synchronous dynamic random access memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), synchronous connection dynamic random access memory (SLDRAM, SyncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, Direct Rambus Random Access Memory) ). The memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
上述本申请实施例揭示的方法可以应用于处理器72中,或者由处理器72实现。处理器72可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器72中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器72可以是通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器72可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器,处理器72读取存储器中的信息,结合其硬件完成前述方法的步骤。The method disclosed in the foregoing embodiments of the present application may be applied to the processor 72 or implemented by the processor 72. The processor 72 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 72 or instructions in the form of software. The aforementioned processor 72 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like. The processor 72 may implement or execute various methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. Combining the steps of the method disclosed in the embodiments of the present application, it may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a storage medium, and the storage medium is located in a memory. The processor 72 reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.
本申请实施例还提供了一种存储介质,具体为计算机存储介质,更具体的为计算机可读存储介质。其上存储有计算机指令,即计算机程序,该计算机指令被处理器执行时上述电子设备侧一个或多个技术方案提供的方法。The embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium. Stored thereon are computer instructions, that is, a computer program, which is a method provided by one or more technical solutions on the electronic device side when the computer instructions are executed by a processor.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和智能设备,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个***,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed method and smart device can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各实施例中的各功能单元可以全部集成在一个第二处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the aforementioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application can be embodied in the form of a software product in essence or a part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is allowed to execute all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.
需要说明的是:“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that: "first", "second", etc. are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
另外,本申请实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。In addition, the technical solutions described in the embodiments of the present application can be combined arbitrarily without conflict.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application.

Claims (12)

  1. 一种数据处理方法,其特征在于,包括:A data processing method, characterized in that it comprises:
    获取目标数据;所述目标数据包括视频数据和音频数据;Acquiring target data; the target data includes video data and audio data;
    利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;Using the target data to perform a character recognition operation for a target person, the target person corresponding to the speaker in the target data;
    在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。A mark for the target person is added to the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played.
  2. 根据权利要求1所述的方法,其特征在于,The method of claim 1, wherein:
    所述人物识别操作,包括:The person recognition operation includes:
    根据所述音频数据识别所述目标人物;Identifying the target person according to the audio data;
    根据所述目标数据识别所述目标人物的目标位置;Identifying the target location of the target person according to the target data;
    所述在所述视频数据中添加针对所述目标人物的标记,包括:The adding a mark for the target person to the video data includes:
    根据所述目标位置在所述视频数据中添加针对所述目标人物的标记。Adding a mark for the target person to the video data according to the target position.
  3. 根据权利要求2所述的方法,其特征在于,The method of claim 2, wherein:
    所述根据所述音频数据识别所述目标人物,包括:The recognizing the target person according to the audio data includes:
    确定所述音频数据对应的目标声纹特征;Determining the target voiceprint feature corresponding to the audio data;
    根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的目标人物;所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;Query a voiceprint database according to the target voiceprint feature to determine a target person corresponding to the audio data; the voiceprint database includes at least one voiceprint model and person information corresponding to the at least one voiceprint model;
    所述根据所述目标数据识别所述目标人物的目标位置,包括:The recognizing the target position of the target person according to the target data includes:
    从图像数据库中确定所述目标人物对应的面部图像,根据所述面部图像对所述视频数据中的至少一帧图像进行图像识别,确定所述目标人物在所述视频数据中的目标位置;所述图像数据库包括至少一个人物的面部图像和所述至少一个人物的面部图像对应的人物信息。Determine the facial image corresponding to the target person from the image database, perform image recognition on at least one frame of the video data according to the facial image, and determine the target position of the target person in the video data; The image database includes a facial image of at least one person and character information corresponding to the facial image of the at least one person.
  4. 根据权利要求2所述的方法,其特征在于,The method of claim 2, wherein:
    所述根据所述音频数据识别所述目标人物,包括:The recognizing the target person according to the audio data includes:
    确定所述音频数据对应的目标声纹特征;Determining the target voiceprint feature corresponding to the audio data;
    根据所述目标声纹特征查询声纹数据库,确定所述音频数据对应的目标人物;所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;Query a voiceprint database according to the target voiceprint feature to determine a target person corresponding to the audio data; the voiceprint database includes at least one voiceprint model and person information corresponding to the at least one voiceprint model;
    所述根据所述目标数据识别所述目标人物的目标位置,包括:The recognizing the target position of the target person according to the target data includes:
    确定所述音频数据对应的声源位置;Determine the sound source location corresponding to the audio data;
    基于位置映射关系,确定所述声源位置对应于所述视频数据中的目标位置;所述位置映射关系表征现实场景与视频数据中场景的位置映射关系。Based on the location mapping relationship, it is determined that the sound source location corresponds to the target location in the video data; the location mapping relationship represents a location mapping relationship between a real scene and a scene in the video data.
  5. 根据权利要求1所述的方法,其特征在于,所述人物识别操作包括:The method according to claim 1, wherein the person recognition operation comprises:
    根据所述视频数据和所述音频数据识别所述目标人物和所述目标人物的目标位置;Identifying the target person and the target position of the target person according to the video data and the audio data;
    所述在所述视频数据中添加针对所述目标人物的标记,包括:The adding a mark for the target person to the video data includes:
    根据所述目标位置在所述视频数据中添加针对所述目标人物的标记。Adding a mark for the target person to the video data according to the target position.
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述视频数据和所述音频数据识别所述目标人物和所述目标人物的目标位置,包括:The method according to claim 5, wherein the identifying the target person and the target position of the target person according to the video data and the audio data comprises:
    对所述视频数据中至少一帧图像进行图像识别,确定所述视频数据中至少一个第一人物和所述至少一个第一人物在所述视频数据中的人物位置;Performing image recognition on at least one frame of image in the video data, and determining at least one first person in the video data and the person position of the at least one first person in the video data;
    确定所述音频数据对应的目标声纹特征;根据所述目标声纹特征查询声纹数据 库,确定所述音频数据对应的第二人物;根据所述音频数据,确定所述音频数据对应的声源位置;Determine the target voiceprint feature corresponding to the audio data; query the voiceprint database according to the target voiceprint feature to determine the second person corresponding to the audio data; determine the sound source corresponding to the audio data according to the audio data position;
    将所述至少一个第一人物对应的人物位置和所述声源位置映射到预设的空间坐标系;所述预设的空间坐标系与现实场景存在位置映射关系,且所述预设的空间坐标系与所述视频数据中场景存在位置映射关系;The character position corresponding to the at least one first character and the sound source position are mapped to a preset space coordinate system; there is a position mapping relationship between the preset space coordinate system and the real scene, and the preset space There is a position mapping relationship between the coordinate system and the scene in the video data;
    确定映射到所述预设的空间坐标系中的符合预设距离条件的第一人物位置和第一声源位置,利用所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物确定目标人物,并将所述目标人物在所述预设的空间坐标系中对应的位置作为目标位置;Determine the first character position and the first sound source position that are mapped to the preset space coordinate system and meet the preset distance condition, and use the first character and the first sound source position corresponding to the first character position The corresponding second person determines the target person, and uses the corresponding position of the target person in the preset spatial coordinate system as the target position;
    其中,所述声纹数据库包括至少一个声纹模型和所述至少一个声纹模型对应的人物信息;Wherein, the voiceprint database includes at least one voiceprint model and character information corresponding to the at least one voiceprint model;
    所述图像数据库包括至少一个人物的面部图像和所述至少一个人物的面部图像对应的人物信息。The image database includes a facial image of at least one person and character information corresponding to the facial image of the at least one person.
  7. 根据权利要求6所述的方法,其特征在于,所述利用所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物确定目标人物,包括:The method according to claim 6, wherein the determining the target person by using the first person corresponding to the position of the first person and the second person corresponding to the position of the first sound source comprises:
    若所述第一人物位置对应的第一人物与所述第一声源位置对应的第二人物相同,将所述第一人物位置对应的第一人物作为所述目标人物;If the first person corresponding to the position of the first person is the same as the second person corresponding to the position of the first sound source, the first person corresponding to the position of the first person is taken as the target person;
    若所述第一人物位置对应的第一人物与所述第一声源位置对应的第二人物不相同,获取第一权重和第二权重;所述第一权重表征根据所述视频数据确定的目标人物的可信度,所述第二权重表征根据音频数据确定的目标人物的可信度;根据所述第一权重对所述第一人物位置对应的第一人物的识别结果作加权处理,并根据所述第二权重对所述第一声源位置对应的第二人物的识别结果作加权处理;根据加权处理结果从所述第一人物位置对应的第一人物和所述第一声源位置对应的第二人物中选择一个人物作为目标人物。If the first person corresponding to the position of the first person is different from the second person corresponding to the position of the first sound source, the first weight and the second weight are obtained; the first weight represents the value determined according to the video data The credibility of the target person, the second weight represents the credibility of the target person determined according to audio data; the recognition result of the first person corresponding to the position of the first person is weighted according to the first weight, And perform weighting processing on the recognition result of the second person corresponding to the first sound source position according to the second weight; according to the weighted processing result, from the first person corresponding to the first person position and the first sound source Select a character from the second character corresponding to the position as the target character.
  8. 根据权利要求4或6所述的方法,其特征在于,所述确定所述音频数据对应的声源位置,包括:The method according to claim 4 or 6, wherein the determining the sound source location corresponding to the audio data comprises:
    获取麦克风阵列中各声道的声音数据;Acquire the sound data of each channel in the microphone array;
    根据所述各声道的声音数据进行位置计算,得到所述声源位置。Perform position calculation according to the sound data of each channel to obtain the sound source position.
  9. 根据权利要求1所述的方法,其特征在于,所述在所述视频数据中添加针对所述目标人物的标记,包括以下之一:The method according to claim 1, wherein the adding a mark for the target person to the video data comprises one of the following:
    在所述视频数据中所述目标位置相应的位置,以浮动窗口或固定窗口的形式呈现标记;所述识别文本添加在所述标记内;At a position corresponding to the target position in the video data, a mark is presented in the form of a floating window or a fixed window; the recognition text is added to the mark;
    在所述视频数据中所述目标位置相应的位置,添加针对目标人物的预设标记;所述预设标记包括以下之一:预设图标、预设图像、针对目标人物的轮廓框架;所述识别文本在所述视频数据的预设位置进行呈现。At the position corresponding to the target position in the video data, a preset mark for the target person is added; the preset mark includes one of the following: a preset icon, a preset image, and an outline frame for the target person; The recognized text is presented at a preset position of the video data.
  10. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    获取单元,配置为获取目标数据;所述目标数据包括视频数据和音频数据;An acquiring unit configured to acquire target data; the target data includes video data and audio data;
    第一处理单元,配置为利用所述目标数据执行针对目标人物的人物识别操作,所述目标人物对应所述目标数据中的说话者;A first processing unit configured to use the target data to perform a person recognition operation for a target person, the target person corresponding to the speaker in the target data;
    第二处理单元,配置为在所述视频数据中添加针对所述目标人物的标记;所述标记以及基于所述音频数据确定的识别文本在播放所述视频数据时进行呈现。The second processing unit is configured to add a mark for the target person to the video data; the mark and the recognized text determined based on the audio data are presented when the video data is played.
  11. 一种电子设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至9任一项所述方法的步骤。An electronic device, characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements any one of claims 1 to 9 when the program is executed. Method steps.
  12. 一种存储介质,其上存储有计算机指令,其特征在于,所述指令被处理器执行时实现权利要求1至9任一项所述方法的步骤。A storage medium having computer instructions stored thereon, wherein the instructions implement the steps of the method according to any one of claims 1 to 9 when the instructions are executed by a processor.
PCT/CN2019/127087 2019-12-20 2019-12-20 Data processing method and apparatus, electronic device, and storage medium WO2021120190A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/127087 WO2021120190A1 (en) 2019-12-20 2019-12-20 Data processing method and apparatus, electronic device, and storage medium
CN201980100983.7A CN114556469A (en) 2019-12-20 2019-12-20 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127087 WO2021120190A1 (en) 2019-12-20 2019-12-20 Data processing method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021120190A1 true WO2021120190A1 (en) 2021-06-24

Family

ID=76478195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127087 WO2021120190A1 (en) 2019-12-20 2019-12-20 Data processing method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN114556469A (en)
WO (1) WO2021120190A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113573096A (en) * 2021-07-05 2021-10-29 维沃移动通信(杭州)有限公司 Video processing method, video processing device, electronic equipment and medium
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium
CN115022733A (en) * 2022-06-17 2022-09-06 中国平安人寿保险股份有限公司 Abstract video generation method and device, computer equipment and storage medium
WO2023006001A1 (en) * 2021-07-29 2023-02-02 华为技术有限公司 Video processing method and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512348A (en) * 2016-01-28 2016-04-20 北京旷视科技有限公司 Method and device for processing videos and related audios and retrieving method and device
CN106328146A (en) * 2016-08-22 2017-01-11 广东小天才科技有限公司 Video subtitle generation method and apparatus
CN106340294A (en) * 2016-09-29 2017-01-18 安徽声讯信息技术有限公司 Synchronous translation-based news live streaming subtitle on-line production system
US20180174587A1 (en) * 2016-12-16 2018-06-21 Kyocera Document Solution Inc. Audio transcription system
CN110035326A (en) * 2019-04-04 2019-07-19 北京字节跳动网络技术有限公司 Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512348A (en) * 2016-01-28 2016-04-20 北京旷视科技有限公司 Method and device for processing videos and related audios and retrieving method and device
CN106328146A (en) * 2016-08-22 2017-01-11 广东小天才科技有限公司 Video subtitle generation method and apparatus
CN106340294A (en) * 2016-09-29 2017-01-18 安徽声讯信息技术有限公司 Synchronous translation-based news live streaming subtitle on-line production system
US20180174587A1 (en) * 2016-12-16 2018-06-21 Kyocera Document Solution Inc. Audio transcription system
CN110035326A (en) * 2019-04-04 2019-07-19 北京字节跳动网络技术有限公司 Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113573096A (en) * 2021-07-05 2021-10-29 维沃移动通信(杭州)有限公司 Video processing method, video processing device, electronic equipment and medium
WO2023006001A1 (en) * 2021-07-29 2023-02-02 华为技术有限公司 Video processing method and electronic device
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium
CN114299944B (en) * 2021-12-08 2023-03-24 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium
CN115022733A (en) * 2022-06-17 2022-09-06 中国平安人寿保险股份有限公司 Abstract video generation method and device, computer equipment and storage medium
CN115022733B (en) * 2022-06-17 2023-09-15 中国平安人寿保险股份有限公司 Digest video generation method, digest video generation device, computer device and storage medium

Also Published As

Publication number Publication date
CN114556469A (en) 2022-05-27

Similar Documents

Publication Publication Date Title
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
US10621991B2 (en) Joint neural network for speaker recognition
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
US11152006B2 (en) Voice identification enrollment
US20190215464A1 (en) Systems and methods for decomposing a video stream into face streams
WO2020119032A1 (en) Biometric feature-based sound source tracking method, apparatus, device, and storage medium
CN112037791B (en) Conference summary transcription method, apparatus and storage medium
CN112088315A (en) Multi-mode speech positioning
US11527242B2 (en) Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view
Minotto et al. Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM
US10922570B1 (en) Entering of human face information into database
CN111050201A (en) Data processing method and device, electronic equipment and storage medium
CN109560941A (en) Minutes method, apparatus, intelligent terminal and storage medium
CN112653902A (en) Speaker recognition method and device and electronic equipment
WO2022179453A1 (en) Sound recording method and related device
TWM594323U (en) Intelligent meeting record system
WO2019227552A1 (en) Behavior recognition-based speech positioning method and device
CN114513622A (en) Speaker detection method, speaker detection apparatus, storage medium, and program product
McCowan et al. Towards computer understanding of human interactions
JP7400364B2 (en) Speech recognition system and information processing method
CN112908336A (en) Role separation method for voice processing device and voice processing device thereof
Cabañas-Molero et al. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
CN111161710A (en) Simultaneous interpretation method and device, electronic equipment and storage medium
WO2019150708A1 (en) Information processing device, information processing system, information processing method, and program
Stiefelhagen et al. Audio-visual perception of a lecturer in a smart seminar room

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19956206

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.11.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19956206

Country of ref document: EP

Kind code of ref document: A1