CN106782545A

CN106782545A - A kind of system and method that audio, video data is changed into writing record

Info

Publication number: CN106782545A
Application number: CN201611170040.6A
Authority: CN
Inventors: 李纯冬
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shirui Electronics Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shirui Electronics Co Ltd
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2017-05-31
Anticipated expiration: 2036-12-16
Also published as: WO2018107605A1; CN106782545B

Abstract

The present invention relates to a kind of system and method that audio, video data is changed into writing record, wherein system includes data collection section, data discrimination section and data tissue part；The data collection section includes audio collection module and video acquisition module；The data discrimination section includes voice and voiceprint identification module and face and Expression Recognition module；The data tissue part according between at the beginning of the text information, identification, the mood of the identity label of current speaker, current speaker, generate writing record.The present invention is more careful intactly to preserve whole audio, video data process, closer to real situation；Audio, video data is converted to text formatting and is preserved by the present invention, greatly reduces the cost of storage and transmission, also allows for subsequently checking record, can more rapidly browse and position conference content.

Description

A kind of system and method that audio, video data is changed into writing record

Technical field

The present invention relates to a kind of data processing technique, and in particular to a kind of to be by what audio, video data changed into writing record System and method.

Background technology

When audio/video conference is held, in order to record conference content, video data and Mike are gathered usually using camera Elegance collection voice data gathers voice data using only microphone, and audio, video data or voice data are preserved into multimedia File, in storage to storage device；By playing multimedia file, conference content can be watched or listened to.Or can be by Special scribe is put down by the input equipments such as computer or handwriting mode, records conference content.

, it is necessary to by audio-video document storage to depositing by the way of the equipment video-audio frequency data such as camera, microphone It is relatively costly in storage equipment, it is therefore desirable to take larger memory space, and the later stage is looked into by playing multimedia file See conference content, it is impossible to fast browsing and navigate on specific topic, it is therefore desirable to spend the more time, and may There is the omission of content, cause inefficiency.Although conference content is recorded by the way of notes is contributed to fast browsing and determined The specific topic in position, but requirement to recorder is higher, if writing speed is far below the speed discussed in meeting, is easy for omitting And error, cause content sufficiently complete and careful, and subsequently check that record cannot reduce scene at that time.

Changed prior art discloses a kind of sound image data bidirectional reversible voice of Chinese language and foreign language and fill the side of captions Method, by recognizing sound image data, by speech recognition therein into text, and translates the text into the foreign language specified, with word The form of curtain is superimposed upon on picture, is stored together with original voice or synchronism output, so that sound image data becomes band There is the sound image data of specified foreign language caption.But the method is, simply by speech recognition into text, and text to be translated The simultaneous display on picture in the form of subtitles after into specified foreign language, but be organized into captions not according to the difference of speaker More press close to the content of truth.

The content of the invention

In order to the conference content for solving above-mentioned record is imperfect and careful, and subsequently check that record cannot be reduced at that time Scene technical problem, the invention provides a kind of system and method that audio, video data is changed into writing record, technology Scheme is as follows.

A kind of system that audio, video data is changed into writing record, including data collection section, data discrimination section with And data tissue part；

The data collection section includes audio collection module and video acquisition module；

The audio collection module is used to capturing the voice data of current speaker, and between recording at the beginning of it is spoken；

The video acquisition module is used to capturing the image of current speaker, and between recording at the beginning of it is spoken；

The data discrimination section includes voice and voiceprint identification module and face and Expression Recognition module；

The voice and voiceprint identification module are processed the voice data that the audio collection module is captured, by its turn The text information of text formatting is melted into, and the voice data that the audio collection module is captured is processed, be used to recognize and work as Preceding speaker, and assign an identity label to each speaker；

The face and Expression Recognition module are entered using Expression Recognition technology to the image that the video acquisition module is captured Row treatment, recognizes the expression of current speaker, obtains the mood of the speaker, and to the image of video acquisition module capture Processed, be used to recognize current speaker, and an identity label is assigned to each speaker；

The data tissue part according between at the beginning of the text information, identification, the identity label of current speaker And the mood of current speaker, generate writing record.

Further, the writing record is the writing record of dialogic operation.

Further, the system also includes data filling and amendment part, and the data filling and amendment part are used for The writing record is supplemented and corrected.

Further, the audio collection module is microphone, and/or, the video acquisition module is camera.

Further, the system also includes memory module, and the memory module is stored with and records the body comprising speaker The tables of data of part label, vocal print feature data and face feature data.

A kind of method that audio, video data is changed into writing record, comprises the following steps：

Step S21, Data Collection：The voice data of current speaker and the image of current speaker are captured, and records it Between at the beginning of speaking；

Step S22, data identification：Voice data to capturing is processed, and converts it into the word letter of text formatting Breath；Image to capturing is processed, and recognizes the expression of current speaker, obtains the mood of the speaker；To the audio for capturing Data and/or the image to capturing are processed, and are used to recognize current speaker, and for each speaker assigns an identity mark Sign；

Step 23, data tissue：The data tissue part according between at the beginning of the text information, identification, it is current The identity label of speaker and the mood of current speaker, generate writing record.

Further, the writing record is the writing record of dialogic operation.

Further, methods described also includes step S24, data filling and amendment：The writing record is supplemented And amendment.

Further, the vocal print feature data and/or face feature data of the identity label and speaker are in association Whether storage in a storage module, before identity label is assigned to each speaker, first searches stored in the memory module There is the identity label matched with the speaker, if do not found, assign the speaker one identity label.

Step S30, preparation：Start microphone and camera, create speaker's list, create the file for preserving text Address, the wherein project of speaker's list include unique identity label, the vocal print feature data of speaker and the face of speaker Portion's characteristic；

Step S31：Capture data：When speaker starts speech, the voice of the speaker that microphones capture is currently made a speech is defeated Between entering at the beginning of the voice data of the speaker for obtaining current speech, participant's voice data of the current speech of record；Simultaneously The image of the speaker of the current speech of camera capture obtains the video data of the speaker of current speech, the current speech of record Between at the beginning of participant's video data；

Step S32：The audio of the speaker of the current speech that microphone is obtained is analyzed and processed using sound groove recognition technology in e Data, and carry out vocal print feature identification；Speaking for the current speech that camera is obtained is analyzed and processed using face recognition technology The video data of person, and carry out face feature identification；

Step S33：Judge whether successfully to recognize vocal print feature data and whether successfully recognize face feature data, If successfully recognizing vocal print feature data and/or face feature data, into step S34；If recognizing sound not successfully Line characteristic and face feature data, then into step S35；

Step S34：Judge whether existed and the vocal print feature data and/or the face feature number in speaker's list According to the speaker for matching, if it is present into step S35, while the relevant information of the complete speaker of supplement；If no In the presence of, then new entry is added in speaker's list, preserve identity label, vocal print feature data and the face feature of new speaker Data, while into step S35；

Step S35：Voice data is analyzed and processed using speech recognition technology, the behaviour that voice data changes into word is carried out Make；

Step S36：Whether successful conversion is into word for audio data, if it is, into step S37；If it is not, then Return to step S35 proceeds conversion operation；

Step S37：Video data is processed using Expression Recognition technical Analysis, the mood of the speaker of current speech is obtained；

Step S38：Obtain current date and time；

Step S39：By identity label, the voice of the speaker of date, time, matching vocal print and/or face feature data The word of identification, the mood of Expression Recognition are organized into the writing record of dialogic operation, and are stored in the file address of establishment；

Step S40：Return to step S31 is continued executing with, and the content of different speaker's speeches is preserved, until conversation procedure knot Beam.

Beneficial effects of the present invention：

The present invention is more careful intactly to preserve whole audio, video data process, closer to real situation；The present invention will Audio, video data is converted to text formatting and is preserved, and greatly reduces the cost of storage and transmission, also allows for subsequently checking note Record, can more rapidly browse and position conference content.

The present invention identifies different participants using sound groove recognition technology in e, face recognition technology；And by speech recognition skill The content of text that art and Expression Recognition technology are obtained is arranged and organized, and forms the writing record of dialogic operation.

Provides convenient user of the present invention supplement the interface of amendment record, to ensure the correctness of writing record, improves It is readable.

Brief description of the drawings

Fig. 1 is the structured flowchart of the system that audio, video data is changed into writing record proposed by the present invention；

Fig. 2 is the structural representation of the system that audio, video data is changed into writing record proposed by the present invention；

The first flow chart of Fig. 3 methods that audio, video data is changed into writing record proposed by the present invention；

Second flow chart of Fig. 4 methods that audio, video data is changed into writing record proposed by the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention become more apparent, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in more detail.Illustrated by taking video conference video recording as an example in embodiment, but people in the art Member knows that the method can be used in all video images, it is not limited to accompanying drawing and following examples.

The present invention is needed using public affairs such as speech recognition technology, sound groove recognition technology in e, face recognition technology, Expression Recognition technologies Technology is known, for obtaining necessary data.Existing each identification technology comparatively perfect, it is possible to achieve " unrelated with text Application on Voiceprint Recognition ", " face tracking ", " facial action identification ", " expression shape change identification " etc., the present invention no longer does to these technologies Detailed description and explanation.It is proposed by the present invention that the system and method that audio, video data changes into writing record are counted according to these According to audio, video data being organized into the complete and writing record of the dialogic operation of image.

Embodiment 1：

As depicted in figs. 1 and 2, it is proposed by the present invention that the system that audio, video data changes into writing record is received including data Collection part, data discrimination section, data tissue part and data filling and amendment part.

Data collection section includes microphone, images first-class data acquisition device.

Microphone is used for the voice data of the participant for capturing current speech, when participant starts speech, microphone collection The voice data of the participant of current speech, the intensity size according to the voice data for collecting judges the participant of current speech Making a speech or pausing, thinking that participant speech terminates if pausing and exceeding certain hour (such as 3s), record is current Between at the beginning of participant's voice data of speech and the end time, by the voice data of the participant of current speech together with beginning Time (can also add the end time) and the device identifier of microphone send data discrimination section to together.Wherein transmit The effect of the device identifier of microphone is to be used to distinguish different conferenced parties in Multi-Party Conference.

Camera is used for the image of the participant for capturing current speech, and when participant starts speech, camera collection is current The image of the participant of speech, judges that the participant of current speech is making a speech or pause according to the image for collecting, if Pause and then think that participant speech terminates more than certain hour (such as 3s), the starting of participant's speech of the current speech of record Time and end time, the image that will be gathered is together with time started (can also add the end time) and the equipment mark of camera Know symbol and send data discrimination section to together.The effect of the wherein device identifier of transmission camera is in order in Multi-Party Conference It is used to distinguish different conferenced parties.

Wherein, in order to ensure the time consistency that microphone and camera are recorded, microphone and camera use identical mark Between punctual.

Data discrimination section includes voice and voiceprint identification module and face and Expression Recognition module.

Voice and voiceprint identification module receive voice data, the time started of microphones capture (at the end of can also adding Between) and microphone device identifier；The voice data for capturing is processed using speech recognition technology, by audio format Voice data change into the text information of text formatting, and using sound groove recognition technology in e at the voice data that captures Reason, is used to recognize the participant of current speech.

Face and Expression Recognition module receive the image of camera capture, the time started (end time can also be added) with And the device identifier of camera, the image for capturing is processed using Expression Recognition technology, the participant of the current speech of identification The expression of person, obtains participant mood at that time, and the image for capturing is processed using face recognition technology, is used to know The participant not made a speech not currently.

The vocal print feature data of the participant of current speech can be recognized using sound groove recognition technology in e, using recognition of face skill Art can recognize the face feature data of the participant of current speech, therefore, voice and voiceprint identification module are according to vocal print feature Data and face and Expression Recognition module can identify and distinguish between out different participants according to face feature data.In addition, The device identifier and face and Expression Recognition module setting using camera of voice and voiceprint identification module using microphone Standby identifier can identify and distinguish between out different conferenced parties.

Voice and voiceprint identification module and face and Expression Recognition module are in processing procedure to the participant that each is made a speech Person assigns unique identity label, for example：During one side's meeting, can be made with " participant A ", " participant B ", " participant C " etc. For identity label is assigned to participant；During Multi-Party Conference, the conducts such as " participant A1 ", " participant B2 ", " participant C1 " can be used Identity label is assigned to participant, and first character " A " wherein in label, " B ", " C " represent each conferenced parties, second character " 1 ", " 2 " represent certain participant in certain conferenced parties.Data tissue portion of tissue writing record for convenience, voice and Application on Voiceprint Recognition The identity label that module and face and Expression Recognition module are assigned to same participant should be identical, for example, can be according to connecing The identity of the participant for identifying voice and voiceprint identification module and face and Expression Recognition module between at the beginning of receiving Label is unified.

For the ease of searching and managing, identity label of the record comprising participant, vocal print feature data and face can be set up The tables of data of portion's characteristic, the information of the participant for recording speech, tables of data storage (is not shown in memory module in figure Go out) in.Vocal print feature data and face feature data can be obtained by sound groove recognition technology in e and face recognition technology, for example, lead to The face recognition technology of 21 point model position locations is crossed, the feature of face can be just described by these key points, accuracy can Reach 96%.

Text information, time started, the identity of participant that data tissue part obtains according to data discrimination section treatment The data such as label, the mood of participant, organize these data according to certain form, generation comprising the time started and The identity label of meeting person, the writing record of the mood of participant and text information are simultaneously preserved.

For the ease of checking, the content line feed record of different participants.

Organized formats for example can be：

【Date】

【Hour Minute Second】【Participant】(【Mood】)：“【Text】”

【Date】

【Hour Minute Second】【Participant】(【Mood】)：“【Text】”

……

In above-mentioned organized formats, symbol "【】" represent the content obtained from foregoing each several part；

It is that the content is optional content that symbol " () " is represented, has data just to add；

Symbol "：" represent the content for being followed by speech.

For example：

On November 15th, 2016

09:24:12 participant A：" our first call for Votes a "

On November 15th, 2016

09:24:16 participant B (anger)：" problem a was discussed, it should call for Votes b "

……

In order to simplify record content, will can be integrated with the content on a date, record can be integrated example as the aforementioned For：

On November 15th, 2016

09:24:12 participant A：" our first call for Votes a "

……

Data filling and amendment part are used to that the writing record that data tissue part preserves to be supplemented and corrected, to carry The readability of writing record high, it is ensured that the correctness of writing record.Such as provides convenient user is carried out to the writing record for preserving Supplement and amendment interface and prompting, and will supplement and corrector name, supplement and the correction time, supplement and amendment content enter Row record, facilitates consultation.

The mode and content of supplement be, for example,：Theme, the summary of problem, meeting of the input frame by user input meeting are provided The information such as conclusion, facilitate other people quickly to understand conference content；

The content of amendment is, for example, the errors in text in writing record, or replace some information, for example：Whole meeting View process has recognized 3 participant's speeches, now ejects prompting frame and prompts the user whether to need the body of " participant A, B, C " Part tag replacement is into respective really name.

Embodiment 2：

The invention allows for a kind of method that audio, video data is changed into writing record, the method flow chart such as Fig. 3 It is shown, comprise the following steps：

Step S21, Data Collection：

When participant starts speech, the voice data of the participant of the current speech of microphone collection, according to the sound for collecting The intensity size of frequency evidence judges that the participant of current speech is making a speech or pausing, if pause exceeding certain hour (example Such as 3s) if think that participant speech terminates, at the beginning of participant's voice data of the current speech of record between and the end time, By the voice data of the participant of current speech together with time started (end time can also be added) and the equipment of microphone Identifier sends data identification step to together.The effect of the wherein device identifier of transmission microphone is in order in Multi-Party Conference When be used to distinguish different conferenced parties；

Camera is used for the image of the participant for capturing current speech, and when participant starts speech, camera collection is current The image of the participant of speech, judges that the participant of current speech is making a speech or pause according to the image for collecting, if Pause and then think that participant speech terminates more than certain hour (such as 3s), the starting of participant's speech of the current speech of record Time and end time, the image that will be gathered is together with time started (can also add the end time) and the equipment mark of camera Know symbol and send data identification step to together.The effect of the wherein device identifier of transmission camera is in order in Multi-Party Conference It is used to distinguish different conferenced parties.

Step S22, data identification：

The equipment for receiving the voice data, time started (end time can also be added) and microphone of microphones capture Identifier；The voice data for capturing is processed using speech recognition technology, the voice data conversion of audio format is written The text information of this form, and the voice data for capturing is processed using sound groove recognition technology in e, it is used to recognize current speech Participant.

Receive the device identification of the image, time started (end time can also be added) and camera of camera capture Symbol, is processed the image for capturing using Expression Recognition technology, the expression of the participant of the current speech of identification, obtains the participant Person's mood at that time, and the image for capturing is processed using face recognition technology, it is used to recognize the participant of current speech.

Voice and voiceprint identification module and face and Expression Recognition module are in processing procedure to the participant that each is made a speech Person assigns unique identity label, for example：During one side's meeting, can be made with " participant A ", " participant B ", " participant C " etc. For identity label is assigned to participant；During Multi-Party Conference, the conducts such as " participant A1 ", " participant B2 ", " participant C1 " can be used Identity label is assigned to participant, and first character " A " wherein in label, " B ", " C " represent each conferenced parties, second character " 1 ", " 2 " represent certain participant in certain conferenced parties.Text information, voice and voiceprint identification module and face are organized for convenience The identity label assigned to same participant with Expression Recognition module should be identical, for example, at the beginning of can be according to receiving Between the identity label of participant that identifies voice and voiceprint identification module and face and Expression Recognition module unified.

For the ease of searching and managing, identity label of the record comprising participant, vocal print feature data and face can be set up The tables of data of portion's characteristic, the information of the participant for recording speech, it is ensured that the identity label of same participant is unique, and And ensure that vocal print feature data and face feature data are corresponding with the identity label of corresponding participant consistent.Vocal print feature data Can be obtained by sound groove recognition technology in e and face recognition technology with face feature data, such as by 21 point model position locations Face recognition technology, the feature of face can just be described by these key points, accuracy can reach 96%.

Step S23, data tissue：

According to text information, time started, the identity label of participant, participant that the treatment of data identification step is obtained The data such as mood, organize these data according to certain form, and generation includes time started, the identity mark of participant The writing record of label, the mood of participant and text information is simultaneously preserved in the form of a dialog.

Organized formats for example can be：

【Date】

【Hour Minute Second】【Participant】(【Mood】)：“【Text】”

……

In above-mentioned organized formats, symbol "【】" represent the content obtained from These steps；

Symbol "：" represent the content for being followed by speech.

For example：

On November 15th, 2016

09:24:12 participant A：" our first call for Votes a "

On November 15th, 2016

……

In order to simplify record content, memory space is saved, will can be integrated with the content on a date, example is as the aforementioned Record can be integrated into：

On November 15th, 2016

09:24:12 participant A：" our first call for Votes a "

……

Step S24, data filling and amendment

The writing record that data organising step is preserved is supplemented and corrected, to improve the readability of writing record, is protected Demonstrate,prove the correctness of writing record.Interface and carry that such as provides convenient user is supplemented and corrected to the writing record for preserving Show, and will supplement and corrector name, supplement and the correction time, supplement and amendment content recorded, facilitate consultation.

Embodiment 3：

The invention allows for a kind of method that audio, video data is changed into writing record, the method flow chart such as Fig. 4 It is shown, comprise the following steps：

Step S30, preparation：

Start microphone and camera, create participant's list, create the file address for preserving text, wherein participant's row Table includes unique identity label of participant, also vocal print feature data and face feature including the participant subsequently to be gathered Data；

Each participant is endowed unique identity label, for example：During one side's meeting, " participant A ", " participant can be used Person B ", " participant C " etc. are assigned to participant as identity label；During Multi-Party Conference, " participant A1 ", " participant can be used B2 ", " participant C1 " etc. are assigned to participant as identity label, and first character " A ", " B ", " C " are represented respectively wherein in label Conferenced parties, second character " 1 ", " 2 " represent certain participant in certain conferenced parties.

Step S31：When participant starts speech, the phonetic entry of the participant that microphones capture is currently made a speech obtains current Between at the beginning of the voice data of the participant of speech, participant's voice data of the current speech of record；Camera capture simultaneously The image of the participant of current speech obtains the video data of the participant of current speech, participant's video of the current speech of record Between at the beginning of data；

Step S32：The audio of the participant of the current speech that microphone is obtained is analyzed and processed using sound groove recognition technology in e Data, and carry out vocal print feature identification；The participant of the current speech that camera is obtained is analyzed and processed using face recognition technology The video data of person, and carry out face feature identification；

Step S34：Judge whether existed and the vocal print feature data and/or the face feature number in participant's list According to the participant for matching, if it is present into step S35, while the relevant information of the complete participant of supplement is (if i.e. Vocal print feature data are only existed in participant's list without face feature data, then supplements face feature data；If participant Face feature data are only existed in person's list without vocal print feature data, then supplements vocal print feature data)；If it does not, New entry is then added in participant's list, identity label, vocal print feature data and the face feature data of new participant are preserved, Enter step S35 simultaneously；

Step S37：Video data is processed using Expression Recognition technical Analysis, the mood of the participant of current speech is obtained；

Step S38：Obtain current date and time；

Step S39：By identity label, the voice of the participant of date, time, matching vocal print and/or face feature data The word of identification, the mood of Expression Recognition are organized into the writing record of dialogic operation, and are stored in the file address of establishment；

The form for example can be：

【Date】

【Hour Minute Second】【Participant】(【Mood】)：“【Text】”

……

Symbol "：" represent the content for being followed by speech.

For example：

On November 15th, 2016

09:24:12 participant A：" our first call for Votes a "

On November 15th, 2016

……

On November 15th, 2016

09:24:12 participant A：" our first call for Votes a "

……

Step S40：Return to step S31 is continued executing with, and the content of different participant's speeches is preserved, until whole meeting knot Beam.

More than, embodiments of the present invention are illustrated.But, the present invention is not limited to above-mentioned implementation method.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements done etc., should be included in guarantor of the invention Within the scope of shield.

Claims

1. a kind of system that audio, video data is changed into writing record, it is characterised in that know including data collection section, data Not part and data tissue part；

The voice and voiceprint identification module are processed the voice data that the audio collection module is captured, and are converted it into The text information of text formatting, and the voice data that the audio collection module is captured is processed, it is used to recognize and currently says Words person, and assign an identity label to each speaker；

At the image that the face and Expression Recognition module are captured using Expression Recognition technology to the video acquisition module Reason, recognizes the expression of current speaker, obtains the mood of the speaker, and the image that the video acquisition module is captured is carried out Treatment, is used to recognize current speaker, and assign an identity label to each speaker；

The data tissue part according to the text information, identification at the beginning of between, the identity label of current speaker and The mood of current speaker, generates writing record.

2. system according to claim 1, it is characterised in that the writing record is the writing record of dialogic operation.

3. system according to claim 1 and 2, it is characterised in that the system also includes data filling and amendment part, The data filling and amendment part are used to that the writing record to be supplemented and corrected.

4. system according to claim 1 and 2, it is characterised in that the audio collection module is microphone, and/or, institute Video acquisition module is stated for camera.

5. system according to claim 1 and 2, it is characterised in that the system also includes memory module, the storage mould Block is stored with and records the tables of data of the identity label comprising speaker, vocal print feature data and face feature data.

6. a kind of method that audio, video data is changed into writing record, it is characterised in that comprise the following steps：

Step S21, Data Collection：The voice data of current speaker and the image of current speaker are captured, and records it and spoken At the beginning of between；

Step S22, data identification：Voice data to capturing is processed, and converts it into the text information of text formatting；It is right The image of capture is processed, and recognizes the expression of current speaker, obtains the mood of the speaker；To the voice data for capturing And/or the image to capturing is processed, it is used to recognize current speaker, and for each speaker assigns an identity label；

Step 23, data tissue：The data tissue part according to the text information, identification at the beginning of between, currently speak The identity label of person and the mood of current speaker, generate writing record.

7. method according to claim 6, it is characterised in that the writing record is the writing record of dialogic operation.

8. the method according to claim 6 or 7, it is characterised in that methods described also include step S24, data filling and Amendment：The writing record is supplemented and corrected.

9. the method according to claim 6 or 7, it is characterised in that the vocal print feature number of the identity label and speaker According to and/or face feature data store in association in a storage module, to each speaker assign identity label before, first look into Whether the identity label for being stored with and having been matched with the speaker is looked in the memory module, if do not found, assigning should One identity label of speaker.

10. a kind of method that audio, video data is changed into writing record, it is characterised in that comprise the following steps：

Step S30, preparation：Start microphone and camera, create speaker's list, create the file ground for preserving text Location, the wherein project of speaker's list include unique identity label, the vocal print feature data of speaker and the face of speaker Characteristic；

Step S31：Capture data：When speaker starts speech, the phonetic entry of the speaker that microphones capture is currently made a speech is obtained Between obtaining at the beginning of the voice data of the speaker for currently making a speech, participant's voice data of the current speech of record；Image simultaneously The image of the speaker of the current speech of head capture obtains the video data of the speaker of current speech, the participant of the current speech of record Between at the beginning of person's video data；

Step S32：The voice data of the speaker of the current speech that microphone is obtained is analyzed and processed using sound groove recognition technology in e, And carry out vocal print feature identification；The speaker's of the current speech obtained using face recognition technology analysis and treatment camera is regarded Frequency evidence, and carry out face feature identification；

Step S33：Judge whether successfully to recognize vocal print feature data and whether successfully recognize face feature data, if Vocal print feature data and/or face feature data are successfully recognized, then into step S34；If failed recognize vocal print spy Data and face feature data are levied, then into step S35；

Step S34：Judge whether existed and the vocal print feature data and/or the face feature data phase in speaker's list The speaker of matching, if it is present into step S35, while the relevant information of the complete speaker of supplement；If do not deposited , then new entry is added in speaker's list, preserve identity label, vocal print feature data and the face feature number of new speaker According to while into step S35；

Step S35：Voice data is analyzed and processed using speech recognition technology, the operation that voice data changes into word is carried out；

Step S36：Whether successful conversion is into word for audio data, if it is, into step S37；If it is not, then returning Step S35 proceeds conversion operation；

Step S38：Obtain current date and time；

Step S39：By identity label, the speech recognition of the speaker of date, time, matching vocal print and/or face feature data Word, the mood of Expression Recognition is organized into the writing record of dialogic operation, and is stored in the file address of establishment；

Step S40：Return to step S31 is continued executing with, and the content of different speaker's speeches is preserved, until conversation procedure terminates.