CN102932673B

CN102932673B - The transmission synthetic method of a kind of vision signal and audio signal, system and device

Info

Publication number: CN102932673B
Application number: CN201110229698.0A
Authority: CN
Inventors: 杜武平; 张启东; 欧阳彬; 向宜; 熊益斌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd
Priority date: 2011-08-11
Filing date: 2011-08-11
Publication date: 2015-08-19
Anticipated expiration: 2031-08-11
Also published as: CN102932673A; HK1176493A1

Abstract

The invention discloses the transmission synthetic method of a kind of vision signal and audio signal, system and device, in order to solve existing video signal and audio signal is asynchronous, affect the problem that user uses.The method is by adding synchronizing signal in the vision signal that sent by different terminals the remote subscriber received and audio signal, and audio signal and vision signal are merged according to this synchronizing signal when merging, thus can ensure that the vision signal that receives at synchronization and audio signal have identical synchronizing signal, when merging, can merge vision signal and audio signal according to this synchronizing signal, ensure the synchronism of vision signal and audio signal in the video and audio file after follow-up merging, thus meet the use of user.

Description

The transmission synthetic method of a kind of vision signal and audio signal, system and device

Technical field

The present invention relates to Internet technical field, particularly relate to the transmission synthetic method of a kind of vision signal and audio signal, system and device.

Background technology

Between user when by different terminals difference transmission of audio signal and vision signal, such as, when obtaining vision signal by the form of visual telephone, Video chat can be carried out by instant communication client, but because network bandwidth resources is limited, and the impact of the data packet number transmitted, there will be the problem that transmission is congested, thus cause there will be the incoherent problem of voice signal in the process of call, the high-quality requirement of audio signal cannot be met.

In order to evade this situation, in actual use, what often adopt is complete the mutual of Audio and Video by different terminals and delivering path, and such as video adopts Ali Wang Wang to carry out alternately, and audio frequency uses IP phone to carry out alternately.But because Audio and Video is from different terminals and approach, how preservation is received video and audio signal by user is saved in same file, becomes a technical problem being difficult to overcome.

Summary of the invention

In view of this, the embodiment of the present invention provides the transmission synthetic method of a kind of vision signal and audio signal, system and device, in order to solve the problem that the existing vision signal from different terminals and audio signal are difficult to be kept in identical file.

The transmission merging method of a kind of vision signal that the embodiment of the present invention provides and audio signal, comprising:

Receive the first audio signal that remote subscriber is sent by first terminal, and pass through the first vision signal of the second terminal transmission;

The synchronizing signal of generation is added in this first audio signal received and the first vision signal;

According to the synchronizing signal of adding in the first audio signal and the first vision signal, the first audio signal and the first vision signal are merged.

The transmission of a kind of vision signal that the embodiment of the present invention provides and audio signal merges device, comprising:

Receiver module, for receiving the first audio signal that remote subscriber is sent by first terminal, and passes through the first vision signal of the second terminal transmission;

Add module, for the synchronizing signal of generation being added in this first audio signal received and the first vision signal;

Merge module, for according to the synchronizing signal of adding in the first audio signal and the first vision signal, the first audio signal and the first vision signal are merged.

The transmission combination system of a kind of vision signal that inventive embodiments provides and audio signal, comprising:

The transmission of above-mentioned vision signal and audio signal merges device, and sends the first terminal of the first audio signal of remote subscriber to described device, and sends the second terminal of the first vision signal to described device.

Embodiments provide a kind of transmission merging method of vision signal and audio signal, system and device, the method is by adding synchronizing signal in the vision signal that sent by different terminals the remote subscriber received and audio signal, and audio signal and vision signal are merged according to this synchronizing signal when merging, thus can ensure that the vision signal that receives at synchronization and audio signal have identical synchronizing signal, when merging, can merge vision signal and audio signal according to this synchronizing signal, ensure the synchronism of vision signal and audio signal in the video and audio file after follow-up merging, thus meet the use of user.

Accompanying drawing explanation

The transmission merging process of the vision signal that Fig. 1 provides for the embodiment of the present application and audio signal;

What Fig. 2 A provided for the embodiment of the present application merges the audio signal of remote subscriber and local user and vision signal the process generating video and audio file;

What Fig. 2 B provided for the embodiment of the present application merges the audio signal of remote subscriber and local user and vision signal the process generating video and audio file;

The transmission of a kind of vision signal that Fig. 3 provides for the embodiment of the present application and audio signal merges apparatus structure schematic diagram;

The transmission combination system structural representation of a kind of vision signal that Fig. 4 provides for the embodiment of the present application and audio signal.

Embodiment

The embodiment of the present application is in order to ensure the synchronism of vision signal and the audio signal merged, provide a kind of transmission merging method of vision signal and audio signal, system and device, the method increases synchronizing signal by the vision signal that sent by different terminals the remote subscriber received and audio signal, namely corresponding synchronizing signal is added in the vision signal received and audio signal simultaneously, when vision signal and audio signal merge, merge according to this corresponding synchronizing signal, thus ensure the synchronism of video file and the audio file merged, during video and audio file after this merging of follow-up use, the demand of user can be met.

Below in conjunction with Figure of description, the embodiment of the present application is described in detail.

The transmission merging process of the vision signal that Fig. 1 provides for the embodiment of the present application and audio signal, this process comprises the following steps:

S101: receive the first audio signal that remote subscriber is sent by first terminal, and the first vision signal passing through the second terminal transmission.

Wherein, this first terminal can be mobile terminal or landline telephone, and this second terminal can for having the terminal of video signal collective and sending function.Or the collection of this vision signal is realized by camera, the video information of the remote subscriber of camera collection is sent to the terminal at local user place by this second terminal.

S102: the synchronizing signal of generation is added in this first audio signal received and the first vision signal.

This synchronizing signal generated according to the cycle of setting in the embodiment of the present application, and further in order to convenience that subsequent audio signal and vision signal merge, the time sequencing that can generate according to signal time synchronous, determine the sequence number of each synchronizing signal, each synchronizing signal gives a sequence number, the sequence number of each synchronizing signal increases progressively according to time sequencing, and the synchronizing signal of each generation is added this synchronizing signal to and generated reception in the first audio signal and the first vision signal.

Because the time interval of synchronizing signal according to setting generates, and the vision signal that remote subscriber sends and audio signal also have certain periodicity, after often generating a synchronizing signal, and after the vision signal that sends of the remote subscriber arrived at this reception and audio signal, this synchronizing signal is added in this vision signal and this audio signal, to identify the synchronism of vision signal and the audio signal received.

Concrete using the clock signal of this terminal as synchronizing signal, with the clock cycle of this terminal central processing unit (CPU) for unit, periodically can generate synchronizing signal in this application.Concrete can with a certain clock cycle for starting point, add up in units of the clock cycle of CPU, and the clock cycle after cumulative is converted into nanosecond, as synchronizing signal, this synchronizing signal can be converted into the timestamp of nanosecond as vision signal and audio signal.

S103: according to the synchronizing signal of adding in the first audio signal and the first vision signal, merges the first audio signal and the first vision signal.

Due in the vision signal that sends the remote subscriber that receives in the embodiment of the present application and audio signal, time sequencing according to the vision signal received and audio signal with the addition of corresponding synchronizing signal, when therefore vision signal and audio signal being merged, can according to the addition of the vision signal after synchronizing signal and audio signal carries out corresponding union operation.When not giving the sequence number of the synchronizing signal of adding in vision signal and audio signal, according to the synchronizing signal of adding in the vision signal received and audio signal, successively according to synchronizing signal, vision signal and audio signal are merged.When imparting the sequence number of the synchronizing signal of adding in vision signal and audio signal, identify the sequence number of synchronizing signal, the vision signal of same sequence number and audio signal are merged.

It is concrete when synchronizing signal is added in vision signal and audio signal, because vision signal and audio signal are all in units of Frame, when receiving vision signal and audio signal, this synchronizing signal is added at the frame head of each vision signal and Frame corresponding to audio signal or postamble, but ensure that vision signal is identical with the position that audio signal adds synchronizing signal, be all frame head or be all postamble.

In the embodiment of the present application, when local user and remote subscriber carry out video and voice communication, remote subscriber adopts two terminals to send audio signal and vision signal to local user.Concrete, remote subscriber adopts first terminal to send audio signal to the terminal at local user place, by the terminal transmission vision signal of the second terminal to local user place.This first terminal can be mobile terminal or landline telephone, this second terminal can for having the terminal of video signal collective and sending function, such as this second terminal can for being provided with the terminal of the prosperous client of Ali, and this terminal can pass through camera collection video information, but only utilizes the ability that this client processes vision signal in the embodiment of the present application.

And the terminal at local user place can receive the audio signal that remote subscriber is sent by first terminal and the vision signal sent by the second terminal.In order to the terminal realizing local user place can received audio signal, the terminal at this local user place can be provided with the audio communication client receiving audio signal and send, such as this client can for having IP-based voice communication (Voice-over-Internet-Protocol, VOIP) audio communication client of function, in order to the terminal realizing local user place can receiving video signals, the client at this local user place can install the client of carrying out video communication, the such as prosperous client of Ali etc., but only utilize the ability that this client processes vision signal in the embodiment of the present application.

Concrete works as this remote subscriber such as, by first terminal, landline telephone, and when carrying out voice communication with the terminal at local user place, this remote subscriber sends audio signal by first terminal.After this first terminal receives this audio signal, this audio signal is sent to PSTN (Public SwitchedTelephone Network, PSTN), by PSTN, this audio signal is sent to voice gateways.Because the terminal at this local user place itself has VOIP function, therefore the audio signal of remote subscriber is being sent to IP network after voice gateways, is sent to the terminal at local user place by IP network.

In order to prevent IP network unstable, the packet being transferred to the audio signal of the terminal at local user place is shaken, in embodiments of the present invention when the data packet transmission of audio signal is after the terminal at local user place, by the data pack buffer of this audio signal, after buffer memory regular hour length, extract the packet of the audio signal after buffer memory.

The audio stream produced because of packet loss to prevent the packet of audio signal discontinuous, in the embodiment of the present application, before first audio signal and the first vision signal are merged, when the terminal at local user place detect there is the packet be dropped in the audio signal received time, adopt the packet that the packet comprising mute signal replaces this to be dropped.

Because the terminal at local user place in the embodiment of the present application is according to the audio signal received, the time delay considering this audio signal is different, and the jitter conditions of network, even if the audio stream play as forward direction local user is discontinuous, but have employed above-mentioned audio signal buffer memory and adopt the packet comprising mute signal to replace the packet be dropped, again the audio signal after above-mentioned process and vision signal are merged afterwards, thus effectively can prevent the shake of network, and the discontinuous problem of audio stream.

This synchronizing signal according to the synchronizing signal generated, is added in this vision signal after receiving the vision signal that remote subscriber sent by the second terminal by the terminal at local user place.Preferably, in order to ensure the accuracy that subsequent video signal and audio signal merge, this vision signal sends in units of frame when sending, and every frame video signal should be able to ensure to be added with corresponding synchronizing signal.

In addition, in the embodiment of the present application in order to complete record local user and the video of remote subscriber and the reciprocal process of audio signal, now using the vision signal of remote subscriber as the first vision signal, its audio signal is as the first audio signal, using the vision signal of local user as the second vision signal, its audio signal is as the second audio signal, and the process of the video and audio file after the merging that this generation is complete comprises: the second audio signal and the second vision signal that receive local user; The synchronizing signal of generation is added in this second audio signal received and the second vision signal; According to the synchronizing signal of adding in the first audio signal and the second audio signal, the first audio signal received and the second audio signal are merged and obtains the audio signal after merging, and retain this synchronizing signal according to the synchronizing signal of adding in the first vision signal and the second vision signal, the first vision signal of receiving and the second vision signal are merged and obtains the vision signal after merging and also retain this synchronizing signal; According to the synchronizing signal retained in the audio signal after merging and vision signal, the audio signal after merging and vision signal are merged.

The detailed description merging process of the vision signal that Fig. 2 A provides for the embodiment of the present application and audio signal, this process comprises the following steps:

S201: receive the first audio signal that remote subscriber is sent by first terminal, and the first vision signal passing through the second terminal transmission.

S202: the synchronizing signal of generation is added in the first audio signal and the first vision signal.

S203: the second audio signal and the second vision signal that receive local user.

S204: the synchronizing signal of generation is added in the second audio signal and the second vision signal.

S205: according to the synchronizing signal of adding in the first audio signal and the second audio signal, the first audio signal received and the second audio signal are merged and obtains the audio signal after merging, and retain this synchronizing signal according to the synchronizing signal of adding in the first vision signal and the second vision signal, the first vision signal of receiving and the second vision signal are merged and obtains the vision signal after merging and also retain this synchronizing signal.

S206: according to the synchronizing signal retained in the audio signal after merging and vision signal, merges the audio signal after merging and vision signal.

The step of step S201 in the above-described embodiments ~ S202 and step S203 ~ S204 in no particular order.

Separately existing, in the embodiment of the present application when merging audio signal and vision signal, first according to second audio signal of the first audio signal of the remote subscriber received and local user, can carry out the merging of audio signal.Concrete receive the first audio signal that first terminal sends after, the synchronizing signal of current generation is added in this first audio signal, and determine whether there is the packet be dropped in this first audio signal, when there is the packet be dropped, the packet comprising mute signal is adopted to substitute this packet be dropped.And when after the second audio signal receiving local user's input, the synchronizing signal of current generation is added in this second audio signal.According to the synchronizing signal of adding in this first audio signal and the second audio signal, first audio signal and the second audio signal are merged and obtains the audio signal after merging, and carrying out in the process merged, this synchronizing signal is remained in the audio signal after this merging, so as follow-up with merge after vision signal merge time use.

For the merging of vision signal, when receiving the first vision signal that remote subscriber is sent by the second terminal, according to the synchronizing signal of current generation, this synchronizing signal is added in this first vision signal, and according to the second vision signal that the local user received sends, and the synchronizing signal of current generation, this synchronizing signal is added in this second vision signal.According to the addition of the first vision signal after synchronizing signal and the second vision signal, the first vision signal and the second vision signal being merged, obtaining the vision signal after merging, and retaining this synchronizing signal in vision signal after this merging.

When needs merge vision signal and audio signal, then according to the synchronizing signal of carrying in the vision signal after merging and the audio signal after merging, vision signal after being combined and the audio signal after merging merge, and obtain the video/audio signal after merging.

In addition, in the embodiment of the present application, owing to also all with the addition of corresponding synchronizing signal in second vision signal and the second audio signal of local user, can according to the synchronizing signal of adding in this second audio signal and the second vision signal, merging is carried out to the second vision signal and the second audio signal and obtains the second video/audio signal, in this second video/audio signal, retain this synchronizing signal.And according to the addition of the first vision signal after synchronizing signal and the first audio frequency, merging being carried out to this first vision signal and the first audio signal and obtains the first video/audio signal, and retain this synchronizing signal in this first video/audio signal.Afterwards according to the synchronizing signal retained in the synchronizing signal retained in this first video/audio signal and the second video/audio signal, this first video/audio signal and the second video/audio signal are merged, obtain the video/audio signal after merging and be also fine.According to merge obtain the video/audio signal after merging time, can according to the needs of self, and the synchronizing signal of adding in each signal carries out concrete enforcement.

In concrete the embodiment of the present application, the generation of this synchronizing signal has certain periodicity.Preferably the terminal at local user place is when generating synchronizing signal, using the clock signal of this terminal as synchronizing signal, with the clock cycle of this terminal central processing unit (CPU) for unit, periodically can generate synchronizing signal.Concrete with a certain clock cycle for starting point, can add up in units of the clock cycle of CPU, and the clock cycle after cumulative is converted into nanosecond, and as synchronizing signal, this synchronizing signal can as the timestamp of vision signal and audio signal.

And in the embodiment of the present application, before the merging of the vision signal after merging and the audio signal after merging, also can first the vision signal after the merging received and the audio signal after merging be recorded respectively, according to the synchronizing signal retained in the vision signal recorded and audio signal, vision signal and audio signal are merged, or also can be real-time to remaining the vision signal after synchronizing signal and audio signal merges, concrete implementation can be selected as required flexibly.

What Fig. 2 B provided for the embodiment of the present application merges the audio signal of remote subscriber and local user and vision signal the process generating video/audio signal.When remote subscriber sends the first audio signal by landline telephone to the terminal at local user place, this first audio signal is sent to voice gateways by PSTN equipment, this first audio signal is converted into the IP packet of audio signal by voice gateways, and by the terminal of this IP Packet Generation to local user place.After the terminal at local user place receives this first audio signal, the synchronizing signal generated according to the clock cycle of self is added in this first audio signal.

And local user sends the second audio signal by microphone to the terminal at its place, after this terminal receives this second audio signal, the synchronizing signal generated according to the clock cycle of self is added in this second audio signal.The terminal at local user place is according to the addition of the first audio signal after synchronizing signal, judge whether there is the packet be dropped in this first audio signal, when judging to there is the packet be dropped in this first audio signal, adopt the packet that the packet comprising mute signal replaces this to be dropped, the process of benefit bag is carried out to this first audio signal.

The terminal at this local user place is according to mending the synchronizing signal of adding in the first audio signal after wrapping process and the second audio signal, this first audio signal and the second audio signal are merged and obtains the audio signal after merging, and retain this synchronizing signal in audio signal after this merging, the audio signal after this merging is recorded.

Concrete when carrying out the recording of audio signal, the audio signal after this merging can be recorded as the broadcast format that the media players such as WAV can be supported.

First vision signal of the remote subscriber collected is sent to second terminal at remote subscriber place by the first camera, wherein this second terminal is provided with instant communication client.Receive the second terminal of this vision signal, this first vision signal is sent to IP network, by IP network, this first vision signal is sent to the terminal at local user place, wherein the terminal at local user place receives this first vision signal by the instant communication client that it is installed.The terminal at local user place is after receiving this first vision signal, the synchronizing signal of current generation is added in this first vision signal, and when being got second vision signal of local user by second camera, the synchronizing signal of current generation is added in this second vision signal.

When merging vision signal, according to the synchronizing signal of adding in the first vision signal and the second vision signal, this first vision signal and the second vision signal are merged, obtains the vision signal after merging, and retain this synchronizing signal in vision signal after this merging.Afterwards the vision signal after this merging is recorded.

Concrete when merging the first vision signal and the second vision signal, because vision signal receives in units of frame, the width of every frame and highly known, therefore, when carrying out merging the vision signal after obtaining merging, the length of every frame or width are set to the twice of length or width maximum in the first vision signal and the second vision signal.When the width of every frame of such as, vision signal after merging is the twice of the width of every frame in the first vision signal or the second vision signal, then in the every frame in the vision signal after this merging, the left side and the right are respectively the corresponding picture frame of the first vision signal and the picture frame of the second vision signal.

The terminal at local user place is according to the audio signal after this synthesis of recording, and the synchronizing signal that retains in vision signal after synthesis, by the vision signal after merging identical for synchronizing signal and the alignment of the audio signal after merging, the vision signal after alignment and audio signal are merged and generates video/audio signal.

This file, when the video and audio file that this video/audio signal of generation is corresponding, can be generated as senior stream format (Advanced Streaming Format, ASF) file by concrete the embodiment of the present application.

The transmission of a kind of vision signal that Fig. 3 provides for the embodiment of the present application and audio signal merges apparatus structure schematic diagram, and described device comprises:

Receiver module 31, for receiving the first audio signal that remote subscriber is sent by first terminal, and passes through the first vision signal of the second terminal transmission;

Add module 32, for the synchronizing signal of generation being added in this first audio signal received and the first vision signal;

Merge module 33, for according to the synchronizing signal of adding in the first audio signal and the first vision signal, the first audio signal and the first vision signal are merged.

Add module 32, specifically for the described synchronizing signal of each generation is added to this synchronizing signal generate reception to the first audio signal and the first vision signal in.

This synchronizing signal generated according to the clock cycle of setting.

Add module 32, specifically for when receiving described first vision signal and the first audio signal, described synchronizing signal is added at the frame head of the first vision signal and each Frame corresponding to the first audio signal or postamble, first vision signal is identical with the position that the first audio signal adds synchronizing signal, is all frame head or is all postamble.

According to the time sequencing that signal time synchronous generates, determine the sequence number of each synchronizing signal, each synchronizing signal gives a sequence number.

In described device,

Described receiver module 31, also for receiving the second audio signal and second vision signal of local user;

Add module 32, also for the synchronizing signal of generation being added in this second audio signal received and the second vision signal;

Merge module 33, also for according to the synchronizing signal of adding in the first audio signal and the second audio signal, the first audio signal received and the second audio signal are merged and obtains the audio signal after merging, and according to the synchronizing signal of adding in the first vision signal and the second vision signal, the first vision signal received and the second vision signal are merged and obtains the vision signal after merging; According to the synchronizing signal retained in the audio signal after merging and vision signal, the audio signal after merging and vision signal are merged.

Described device also comprises:

Judge module 34, for judging whether described first audio signal exists the packet be dropped; When judging to there is the packet be dropped in described first audio signal, adopt the packet that the packet comprising mute signal replaces this to be dropped.

The transmission combination system structural representation of a kind of vision signal that Fig. 4 provides for the embodiment of the present application and audio signal, described system comprises: the transmission of vision signal described above and audio signal merges device 41, and send the first terminal 42 of the first audio signal of remote subscriber to described device, and send the second terminal 43 of the first vision signal to described device.

Described system also comprises:

PSTN equipment 44, for receiving the first audio signal of the described remote subscriber that described first terminal sends, is sent to voice gateways by described first audio signal;

Voice gateways 45, for receiving the first audio signal of the described remote subscriber that described PSTN equipment sends, are converted to VOIP Packet Generation by described audio signal.

Embodiments provide a kind of transmission merging method of vision signal and audio signal, system and device, the method is by adding synchronizing signal in the vision signal that sends the remote subscriber that receives and audio signal, and audio signal and vision signal are merged according to this synchronizing signal when merging, thus can ensure that the vision signal that receives at synchronization and audio signal have identical synchronizing signal, when merging, can merge vision signal and audio signal according to this synchronizing signal, ensure the synchronism of vision signal and audio signal in the video and audio file after follow-up merging, thus meet the use of user.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a transmission synthetic method for vision signal and audio signal, is characterized in that, comprising:

According to the synchronizing signal of adding in the first audio signal and the first vision signal, the first audio signal and the first vision signal are merged;

Described method also comprises:

Receive the second audio signal and second vision signal of local user;

The synchronizing signal of generation is added in this second audio signal received and the second vision signal;

According to the synchronizing signal of adding in the first audio signal and the second audio signal, the first audio signal received and the second audio signal are merged and obtains the audio signal after merging, and retain this synchronizing signal according to the synchronizing signal of adding in the first vision signal and the second vision signal, the first vision signal of receiving and the second vision signal are merged and obtains the vision signal after merging and also retain this synchronizing signal;

According to the synchronizing signal retained in the audio signal after merging and vision signal, the audio signal after merging and vision signal are merged.

2. the method for claim 1, is characterized in that, described synchronizing signal generated according to the clock cycle of setting.

3. method as claimed in claim 2, is characterized in that, the described synchronizing signal of each generation add to this synchronizing signal generate reception to the first audio signal and the first vision signal in.

4. method as claimed in claim 3, it is characterized in that, when receiving described first vision signal and the first audio signal, described synchronizing signal is added at the frame head of the first vision signal and each Frame corresponding to the first audio signal or postamble, first vision signal is identical with the position that the first audio signal adds synchronizing signal, is all frame head or is all postamble.

5. method as claimed in claim 2, is characterized in that, the time sequencing generated according to signal time synchronous, determine the sequence number of each synchronizing signal, each synchronizing signal gives a sequence number.

6. the method for claim 1, is characterized in that, described by before the first audio signal and the merging of the first vision signal, described method also comprises:

Judge whether described first audio signal exists the packet be dropped;

When judging to there is the packet be dropped in described first audio signal, adopt the packet that the packet comprising mute signal replaces this to be dropped.

7. the method for claim 1, is characterized in that, described first terminal is mobile terminal or landline telephone, and the second terminal is the terminal with video signal collective and sending function.

8. the method for claim 1, is characterized in that, described remote subscriber sends the first audio signal by first terminal and comprises:

Described first terminal receives the first audio signal that described remote subscriber sends, described first audio signal is sent to common exchanging telephone network PSTN, after described first audio signal being sent to voice gateways by described PSTN, be converted to IP-based voice communication VOIP Packet Generation.

9. the transmission of vision signal and audio signal merges a device, and it is characterized in that, described device comprises:

Merge module, for according to the synchronizing signal of adding in the first audio signal and the first vision signal, the first audio signal and the first vision signal are merged;

Described receiver module, also for receiving the second audio signal and second vision signal of local user;

Add module, also for the synchronizing signal of generation being added in this second audio signal received and the second vision signal;

Merge module, also for according to the synchronizing signal of adding in the first audio signal and the second audio signal, the first audio signal received and the second audio signal are merged and obtains the audio signal after merging, and according to the synchronizing signal of adding in the first vision signal and the second vision signal, the first vision signal received and the second vision signal are merged and obtains the vision signal after merging; According to the synchronizing signal retained in the audio signal after merging and vision signal, the audio signal after merging and vision signal are merged.

10. device as claimed in claim 9, is characterized in that, described interpolation module, specifically for the described synchronizing signal of each generation is added to this synchronizing signal generate reception to the first audio signal and the first vision signal in.

11. devices as claimed in claim 10, it is characterized in that, described interpolation module, specifically for when receiving described first vision signal and the first audio signal, described synchronizing signal is added at the frame head of the first vision signal and each Frame corresponding to the first audio signal or postamble, first vision signal is identical with the position that the first audio signal adds synchronizing signal, is all frame head or is all postamble.

12. devices as claimed in claim 9, it is characterized in that, described device also comprises:

Judge module, for judging whether described first audio signal exists the packet be dropped; When judging to there is the packet be dropped in described first audio signal, adopt the packet that the packet comprising mute signal replaces this to be dropped.

The transmission combination system of 13. 1 kinds of vision signals and audio signal, it is characterized in that, described system comprises: the device as described in as arbitrary in claim 9 ~ 12, and send the first terminal of the first audio signal of remote subscriber to described device, and send the second terminal of the first vision signal to described device.

14. systems as claimed in claim 13, it is characterized in that, described system also comprises:

Common exchanging telephone network PSTN equipment, for receiving the first audio signal of the described remote subscriber that described first terminal sends, is sent to voice gateways by described first audio signal;

Voice gateways, for receiving the first audio signal of the described remote subscriber that described PSTN equipment sends, are converted to IP-based voice communication VOIP Packet Generation by described audio signal.