CN111276152A - Audio processing method, terminal and server - Google Patents

Audio processing method, terminal and server Download PDF

Info

Publication number
CN111276152A
CN111276152A CN202010360286.XA CN202010360286A CN111276152A CN 111276152 A CN111276152 A CN 111276152A CN 202010360286 A CN202010360286 A CN 202010360286A CN 111276152 A CN111276152 A CN 111276152A
Authority
CN
China
Prior art keywords
audio
voice
audio data
server
data packet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010360286.XA
Other languages
Chinese (zh)
Inventor
高毅
陈冰
黄淳林
陈静聪
奚驰
游利为
罗程
李斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010360286.XA priority Critical patent/CN111276152A/en
Publication of CN111276152A publication Critical patent/CN111276152A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the application provides an audio processing method, a terminal and a server, which relate to the technical field of communication, and the method comprises the following steps: the terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends an audio data packet containing the voice characteristic information to the server. The server obtains the voice characteristic information from the audio data packets, performs route selection based on the voice characteristic information, selects a target audio signal from the coded audio signals of each audio data packet, and performs sound mixing processing based on the selected target audio signal. Compared with the voice characteristic extraction after the server decodes the audio data packet, the voice characteristic information obtained by the terminal through the voice characteristic extraction more truly reflects the voice section and the non-voice section of the audio, and therefore the accuracy of the server routing is improved. The server does not need to decode all audio data packets and does not need to extract voice features, so that the consumption of the server is reduced.

Description

Audio processing method, terminal and server
Technical Field
The embodiment of the application relates to the technical field of communication, in particular to an audio processing method, a terminal and a server.
Background
In a voice/video communication system where a plurality of persons interact, audio mixing is an indispensable step. At present, a server receives and decodes multiple audio data streams, calculates audio characteristics according to the decoded audio data streams, and selects an audio data stream with a person speaking based on the audio characteristics. And then mixing the audio data streams into one path, recoding and packaging the mixed audio data stream, and sending the audio data stream serving as the one path of audio data stream to a receiving client. Since the audio decoded by the server is generally the audio processed by the audio processing link of the terminal, the audio characteristics contained in the audio do not necessarily reflect the difference between the speech segment and the non-speech segment, thereby affecting the routing effect of the server.
Disclosure of Invention
The embodiment of the application provides an audio processing method, a terminal and a server, which are used for improving the routing effect of the server and reducing the consumption of the server.
In one aspect, an embodiment of the present application provides an audio processing method, including:
receiving audio data packets sent by N terminals, wherein each audio data packet comprises voice characteristic information and a coded audio signal, and the voice characteristic information is obtained after voice characteristic extraction is carried out on the collected audio signals;
selecting target audio signals sent by M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in each audio data packet, wherein M is a positive integer smaller than N;
and performing sound mixing processing based on the target audio signals sent by the M terminals.
In one aspect, an embodiment of the present application provides an audio processing method, including:
carrying out voice feature extraction on the collected audio signals to obtain voice feature information;
after carrying out automatic gain control and coding on the acquired audio information number, packaging the audio information number and the voice characteristic information to obtain an audio data packet;
and sending the audio data packets to a server so that the server selects target audio signals sent by M terminals from the encoded audio signals in each audio data packet according to the voice characteristic information corresponding to the encoded audio signals in the audio data packets sent by the N terminals, wherein M is a positive integer smaller than N, and the audio mixing processing is carried out based on the target audio signals sent by the M terminals.
In one aspect, an embodiment of the present application provides a server, including:
the receiving module is used for receiving audio data packets sent by the N terminals, each audio data packet comprises voice characteristic information and a coded audio signal, and the voice characteristic information is obtained after voice characteristic extraction is carried out on the collected audio signals;
the screening module is used for selecting target audio signals sent by M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in each audio data packet, wherein M is a positive integer smaller than N;
and the first processing module is used for carrying out sound mixing processing on the basis of the target audio signals sent by the M terminals.
Optionally, the voice feature information at least includes audio energy, and the audio energy is energy of voice in the audio signal.
Optionally, the voice feature information further includes a voice flag, where the voice flag is obtained by performing voice detection on the acquired audio signal by the terminal and sending the audio signal to the server, or is obtained by performing voice detection on the received encoded audio signal by the server.
Optionally, the first processing module is specifically configured to:
sequentially decoding, mixing and encoding the target audio signals sent by the M terminals to obtain a first mixed audio data packet, and respectively sending the first mixed audio data packet to each other terminal except the M terminals;
and respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
Optionally, the first mixed audio data packet further includes voice tags corresponding to M target audio signals, and the second mixed audio data packet further includes voice tags corresponding to M-1 target audio signals.
Optionally, the first processing module is specifically configured to:
and sequentially decoding, mixing and encoding the target audio signals sent by the M terminals to obtain mixed audio data packets, and respectively sending the mixed audio data packets to each terminal connected with the server in a call mode, wherein the mixed audio data packets further comprise voice signs corresponding to the M target audio signals.
In one aspect, an embodiment of the present application provides a terminal, including:
the characteristic extraction module is used for extracting voice characteristics of the collected audio signals to obtain voice characteristic information;
the second processing module is used for packing the collected audio information number and the voice characteristic information after carrying out automatic gain control and coding on the collected audio information number to obtain an audio data packet;
the sending module is used for sending the audio data packets to a server so that the server selects target audio signals sent by M terminals from the coded audio signals in each audio data packet according to voice characteristic information corresponding to the coded audio signals in the audio data packets sent by the N terminals, wherein M is a positive integer smaller than N; and performing sound mixing processing based on the target audio signals sent by the M terminals.
In one aspect, an embodiment of the present application provides an audio processing system, including:
a server and N terminals;
each terminal of the N terminals is used for extracting voice characteristics of the collected audio signals to obtain voice characteristic information; after carrying out automatic gain control and coding on the acquired audio information number, packaging the audio information number and the voice characteristic information to obtain an audio data packet, and sending the audio data packet to a server;
the server is used for selecting target audio signals sent by M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in the audio data packets sent by the N terminals, wherein M is a positive integer smaller than N, and the audio mixing processing is carried out based on the target audio signals sent by the M terminals.
In one aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the audio processing method when executing the program.
In one aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program executable by a computer device, and when the program runs on the computer device, the computer device is caused to execute the steps of the audio processing method.
In the embodiment of the application, the terminal extracts the voice characteristics of the collected audio signals to obtain the voice characteristic information, then sends the audio data packet containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packet and selects a route based on the voice characteristic information.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic diagram of an audio processing method in the prior art;
FIG. 2 is a schematic diagram of a system architecture diagram according to an embodiment of the present application;
fig. 3a is a schematic diagram of an interface of a social application provided in an embodiment of the present application;
FIG. 3b is a schematic diagram of an interface of a social application provided in an embodiment of the present application;
fig. 4a is a schematic diagram of an interface of an office application provided in an embodiment of the present application;
fig. 4b is a schematic diagram of an interface of an office application provided in an embodiment of the present application;
fig. 5 is a schematic diagram illustrating a flow of an audio processing method according to an embodiment of the present application;
fig. 6 is a schematic diagram of an audio mixing method according to an embodiment of the present application;
fig. 7 is a schematic diagram of an audio mixing method according to an embodiment of the present application;
fig. 8 is a schematic diagram of an audio mixing method according to an embodiment of the present application;
fig. 9 is a schematic diagram of an audio buffering method according to an embodiment of the present application;
fig. 10 is a schematic diagram of an audio processing method according to an embodiment of the present application;
fig. 11a is a schematic diagram of an interface of an office application provided in an embodiment of the present application;
fig. 11b is a schematic diagram of an interface of an office application provided in an embodiment of the present application;
fig. 12 is a schematic diagram of a process for generating an audio data packet according to an embodiment of the present application;
fig. 13 is a schematic diagram of an audio processing method according to an embodiment of the present application;
fig. 14 is a schematic diagram of an audio processing method according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of an audio processing system according to an embodiment of the present application;
fig. 18 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solution and beneficial effects of the present application more clear and more obvious, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
For convenience of understanding, terms referred to in the embodiments of the present application are explained below.
Cloud technology (Cloud technology): based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, a resource pool can be formed and used as required, and the cloud computing business model is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
Cloud conference: the cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate. At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences. For example, in the embodiment of the present application, a multi-person voice or video conference may be performed based on cloud computing.
In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface.
The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular with many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrading of internal management level, and the video conferences are widely applied to various fields such as governments, armies, transportation, finance, operators, education, enterprises and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of video conference application is necessarily stimulated.
Automatic gain control: automatic Gain Control (AGC) refers to an Automatic Control method for automatically adjusting the Gain of an amplifier circuit according to signal intensity. The circuit that implements this function is referred to as an AGC loop. The AGC loop is a closed-loop electronic circuit, which is a negative feedback system, and can be divided into two parts, a gain-controlled amplifying circuit and a control voltage forming circuit. The gain controlled amplifying circuit is located in the forward amplifying path, and the gain of the gain controlled amplifying circuit is changed along with the control voltage. The basic components of the control voltage forming circuit are an AGC detector and a low-pass smoothing filter, and may include components such as a gate circuit and a dc amplifier.
At present, in a voice/video communication system with multi-person interaction, since a plurality of persons may speak at the same time, mixing is an essential step for a person participating in a conference to clearly hear the voices of the plurality of persons. Currently, a server receives audio data packets sent by a plurality of terminals and decodes the audio data packets, respectively, then calculates voice characteristics according to audio signals obtained after decoding, and selects an audio data packet including a voice segment based on the voice characteristics. And then mixing the selected audio data packets into a mixed audio data packet and sending the mixed audio data packet to a receiving terminal. Exemplarily, as shown in fig. 1, 5 terminals are set to be in call connection with a server, the server receives audio data packets sent by the terminals 1 to 4, and a decoding module in the server decodes the received audio data packets to obtain audio signals corresponding to the terminals 1 to 4, respectively. And an analysis module in the server calculates voice characteristics based on the audio signals obtained after decoding, obtains the voice characteristics 1 to 4 and sends the voice characteristics to a routing module in the server. And the route selection module in the server selects the audio signals respectively sent by the terminal 1 to the terminal 4 according to the voice characteristics. Assuming that the routing result is the audio signal sent by the terminal 1, the terminal 2 and the terminal 4, the routing result is sent to the audio mixing module of the server. And the audio mixing module in the server performs audio mixing on the audio signals sent by the terminal 1, the terminal 2 and the terminal 4 according to the routing result to obtain mixed audio signals. The encoding module in the server encodes the mixed audio signal and then transmits the encoded mixed audio signal to the terminal 5.
Since the server decodes all received audio data streams, calculates speech features, and then performs routing, mixing and encoding, the decoding and speech feature calculation both require a large amount of Central Processing Units (CPUs), which results in excessive consumption of the server. Secondly, the audio decoded by the server is generally the audio processed by the audio processing link of the terminal, for example, after the terminal performs Automatic Gain Control (AGC) on the audio, the non-speech segment is relatively amplified, and when the server performs speech feature extraction on the audio, the obtained speech feature cannot truly reflect the feature of the non-speech segment in the audio, thereby affecting the routing effect of the server. In view of this, in the embodiment of the present application, before the terminal processes the acquired audio signal, the terminal performs voice feature extraction on the acquired audio signal to obtain voice feature information, then performs automatic gain control and coding on the acquired audio signal, and then packages the audio signal and the voice feature information together to obtain an audio data packet, and then sends the audio data packet to the server. After receiving the audio data packets sent by the N terminals, the server respectively obtains the voice characteristic information from each audio data packet, and then selects the target audio signals sent by the M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in each audio data packet, wherein M is a positive integer smaller than N. And then mixing sound based on the target audio signals sent by the M terminals.
The terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends audio data packets containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packets and selects a route based on the voice characteristic information.
Referring to fig. 2, it is a diagram of a system architecture applicable to the embodiment of the present application, the system architecture at least includes N terminals 101 and a server 102. The N terminals 101 are terminals 101-1 to 101-N shown in FIG. 2, where N is a positive integer, and the value of N is not limited in this embodiment.
An application program for multi-person conversation may be installed in the terminal 101, and the application program may be a social application program, an office application program, or the like, and a user may use the application program to realize multi-person conversation.
Illustratively, the setting terminal installs the social application program in advance, and the user first establishes the chat group XXXX in the social application program and then clicks the "voice call" icon in the chat group interface, as shown in fig. 3 a. And then select a plurality of members from the chat group to initiate the voice chat. When the selected member receives the voice chat request, the selected member clicks the answer icon displayed on the terminal to join the voice chat, and the multi-user voice chat interface is specifically shown in fig. 3 b. The member joining the voice chat can click the video icon, open the camera and switch the voice chat to the video chat.
Illustratively, the terminal is configured to install an office application in advance, and an interface of the office application includes a "join meeting" icon, a "quick meeting" icon, and a "scheduled meeting" icon, which is specifically shown in fig. 4 a. User a may click on the icon for "speed meeting" in the interface of the office application as a meeting initiator to enter the meeting interface. The conference interface includes a conference number, a "mute" icon, an "open video" icon, an "administrator" icon, an "end" icon, and the like. User a may invite the people participating in the conference interface, setting user a to invite user B and user C to the conference. When the invited person receives the conference request, the invited person clicks the answer icon displayed on the terminal to agree to join the voice conference, and an interface of the multi-person voice conference is specifically shown in fig. 4 b. The members participating in the conference can click the 'video on' icon, turn on the camera, and switch the voice conference to the video conference. The conference initiator may click on the "end" icon to end the voice or video conference.
In addition, the terminal 101 may also be equipped with a browser, and the user enters a web page for multi-person conversation using the browser, and then uses the web page to realize multi-person conversation. The multi-person call may be a video call or a voice call, which is not specifically limited in the embodiment of the present application. The terminal 101 may include one or more processors 1011, memory 1012, an I/O interface 1013 for interacting with the server 102, and a display panel 1014, among other things. The memory 1012 of the terminal 101 may store program instructions for audio processing, which when executed by the processor 1011 can be used to process audio and display an interface for multi-person conversation on the display panel 1014. The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like.
In the specific implementation, each terminal 101 exchanges information with the server 102 through signaling, the server 102 establishes connection with the terminal 101 after authentication of the terminal 101 is passed, and then the terminal 101 sends an audio data packet to the server 102. The server 102 is configured to perform unpacking, decoding, routing, mixing, encoding, and packing on audio data packets sent by multiple terminals to obtain a mixed audio data packet, and then send the mixed audio data packet to the terminal 101 in call connection with the server 102. In addition, the server 102 monitors the call process during the call, for example, monitors whether a new terminal is accessed or exited, and releases resources by performing data connection and signaling connection with the terminal 101 when the call is ended. The server 102 may include one or more processors 1021, memory 1022, and an I/O interface 1023 to interact with the terminal 101, among other things. Signaling interaction functions, authentication functions, call control functions, unpacking functions, decoding functions, routing functions, mixing functions, encoding functions, and packing functions of the server 102 may be implemented on the one or more processors 1021. In addition, server 102 may also configure database 1024. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.
Based on the system architecture diagram shown in fig. 2, an embodiment of the present application provides a flow of an audio processing method, as shown in fig. 5, where the flow of the method is executed by a terminal and a server interactively, and the method includes the following steps:
step S501, the terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information.
Specifically, the terminal collects an audio signal through a microphone, then performs echo cancellation and denoising on the audio signal, and then performs voice feature extraction on the audio signal from which the echo and the noise are removed to obtain voice feature information.
Step S502, the terminal packs the collected audio information number with the voice characteristic information after performing automatic gain control and coding on the collected audio information number, and obtains an audio data packet.
In step S503, the terminal sends the audio data packet to the server.
Specifically, the terminal performs automatic gain control on the acquired audio signal, adjusts the audio signal to a suitable volume, then performs compression coding on the audio signal, for example, performs coding by using vocoders such as g.729, talk or opus, and the like, to obtain a coded audio signal, and then performs packaging on the coded audio signal together with the voice feature information, to obtain an audio data packet. In one possible implementation, the voice feature information is embedded in a specified field in a header of a Real-time Transport Protocol (RTP), and then the packaged audio data packet is sent to the server through the RTP.
When the user turns off the microphone of the terminal or the energy of the audio signal detected by the terminal is less than the preset threshold, the terminal may not send the audio data packet to the server. The terminal may also send the audio data packet to the server all the time, and this is not limited in this embodiment of the present application.
Step S504, the server receives the audio data packets sent by the N terminals.
Specifically, each audio data packet includes voice feature information and a coded audio signal, the voice feature information is obtained by a terminal after voice feature extraction is performed on the acquired audio signal, and the N terminals are all or part of terminals in call connection with the server. If the user closes the microphone of the terminal or the energy of the audio signal detected by the terminal is less than the preset threshold value, the terminal does not send the audio data packet to the server, and the server may receive the audio data packet sent by a part of terminals in call connection with the server. And if the terminal always sends the audio data packet to the server, the server receives the audio data packets sent by all terminals in call connection with the server.
Step S505, the server selects, from the encoded audio signals in each audio data packet, target audio signals sent by M terminals according to the speech feature information corresponding to the encoded audio signals in each audio data packet.
Specifically, M is a positive integer smaller than N, the voice feature information is used for indicating whether voice exists in the audio signals, and the server selects target audio signals with voice sent by M terminals from the coded audio signals in each audio data packet according to the voice feature information.
In step S506, the server performs mixing processing based on the target audio signals sent by the M terminals.
In one possible embodiment, when M is greater than or equal to the number of terminals that transmit audio packets to the server, the server performs mixing processing based on the received audio signals encoded in the audio packets transmitted by all the terminals.
Illustratively, setting M to 5, and 10 terminals in call connection with the server, where 7 terminals have their microphones turned off, and 3 terminals transmit audio data packets to the server, the server generates and transmits a mixed audio data packet for each of the 10 terminals in call connection with the server based on the received encoded audio signals in the audio data packets transmitted by the 3 terminals.
The terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends audio data packets containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packets and selects a route based on the voice characteristic information.
Optionally, in step S505, the voice feature information at least includes audio energy, where the audio energy may be energy of voice in the audio signal, or energy of the audio signal, and generally, the audio energy of a voice segment is greater than that of a non-voice segment.
Optionally, the voice feature information further includes a voice flag, where the voice flag is obtained by performing voice detection on the collected audio signal by the terminal and sending the audio signal to the server, or is obtained by performing voice detection on the received encoded audio signal by the server.
Specifically, the voice flag is a Voice Activity Detection (VAD) flag, and the VAD flag is obtained by analyzing and calculating an audio coding parameter and a feature value thereof, and then determining whether a voice signal exists in a current audio signal by using a preset logic judgment criterion. When the terminal generates the voice mark, the terminal can automatically perform voice detection on the acquired audio signal to obtain the voice mark, or perform voice detection on the acquired audio signal to obtain the voice mark after receiving an instruction input by a user, and then send the audio energy and the voice mark as voice characteristic information to the server. After the server receives the audio data packet, an unpacking module in the server unpacks the audio data packet to obtain a coded audio signal, audio energy and a voice mark. When the server generates the voice mark, after the server receives the audio data packet, an unpacking module in the server unpacks the audio data packet to obtain a coded audio signal and audio energy, and then voice detection is carried out on the coded audio signal to obtain the voice mark.
When a server selects target audio data packets sent by M terminals from each audio data packet according to speech feature information corresponding to a coded audio signal in each audio data packet, embodiments of the present application provide at least the following several implementation manners:
in a possible implementation manner, the voice characteristic information corresponding to the encoded audio signal includes audio energy and a voice flag, the server screens out the audio signal with the voice flag from the encoded audio signal in each audio data packet, then sorts the screened out audio signal according to the audio energy from large to small, and determines the audio signal with the top M bits as the target audio signal.
Specifically, the voice flag includes two flags, i.e., voice flag and no-voice flag, and in a specific implementation, 1 may be used to indicate voice, and 0 may be used to indicate no-voice. The server firstly screens out the audio signals with voice marks as the audio signals with voice from the coded audio signals in each audio data packet. And when the number of all the screened audio signals is greater than M, sorting the screened audio signals according to the audio energy from large to small, and determining the audio signals with the top M bits as target audio signals. When the number of all the screened audio signals is not more than M, the screened audio signals are determined as target audio signals, and the server can select the coded audio signals as the target audio signals according to whether the corresponding terminal sends voices recently or not from the coded audio signals with the voice marks of no voices, so that the number of the target audio signals reaches M.
Illustratively, setting M to 5, the server receives audio data packets respectively sent by the terminal 1 to the terminal 10, which are respectively the audio data packet 1 to the audio data packet 10, where a voice flag in 7 audio data packets is 1, which are respectively the audio data packet 1 to the audio data packet 7, and a voice flag in 3 audio data packets is 0, which are respectively the audio data packet 8 to the audio data packet 10. The server screens 7 encoded audio signals with corresponding voice marks as having voices from the encoded audio signals in the 10 audio data packets, and the signals are respectively encoded audio signals 1 to 7. Then, according to the sequence of the audio energy from large to small, the coded audio signal 1 to the coded audio signal 7 are sequenced, and the sequencing result is as follows: encoded audio signal 1, encoded audio signal 2, encoded audio signal 7, encoded audio signal 5, encoded audio signal 3, encoded audio signal 6, encoded audio signal 4. Taking the coded audio signal in the top 5 as the target audio signal, respectively: encoded audio signal 1, encoded audio signal 2, encoded audio signal 7, encoded audio signal 5, encoded audio signal 3.
Illustratively, setting M to 5, the server receives audio data packets respectively sent by the terminal 1 to the terminal 10, which are respectively the audio data packet 1 to the audio data packet 10, where a voice flag in 3 audio data packets is 1, a voice flag in 1 audio data packet to 3 audio data packet, a voice flag in 7 audio data packets is 0, and an audio data packet 4 to the audio data packet 10. The server screens 3 coded audio signals with corresponding voice marks as voice from the coded audio signals in the 10 audio data packets, wherein the coded audio signals are respectively a coded audio signal 1, a coded audio signal 2 and a coded audio signal 3. Since the number of the screened encoded audio signals is less than 5, all of the screened 3 encoded audio signals are taken as the target audio signal. If the encoded audio signals transmitted from the terminals 4 and 6 received by the server in the past 1 minute include speech, the encoded audio signals 4 and 6 transmitted this time from the terminals 4 and 6 are also set as target audio signals. The voice mark and the audio energy are combined for selecting the route, and the considered voice characteristics are more comprehensive, so that the accuracy of selecting the route is improved.
In a possible implementation manner, the speech characteristic information corresponding to the encoded audio signal is audio energy, and the server sorts the encoded audio signals in each audio data packet according to the order of the audio energy from large to small, and determines the encoded audio signal in the top M-bit audio data packet as the target audio signal.
Illustratively, setting M to 5, the server receives audio data packets respectively sent by the terminals 1 to 10, namely, the audio data packets 1 to 10. The server sorts the encoded audio signals in the audio data packets 1 to 10 according to the sequence of the audio energy from large to small, and the result of the sorting is as follows: encoded audio signal 1, encoded audio signal 2, encoded audio signal 10, encoded audio signal 9, encoded audio signal 7, encoded audio signal 8, encoded audio signal 5, encoded audio signal 3, encoded audio signal 6, encoded audio signal 4. Determining the coded audio signal ranked at the top 5 as a target audio signal, respectively: encoded audio signal 1, encoded audio signal 2, encoded audio signal 10, encoded audio signal 9, encoded audio signal 7. Because the audio energy of the voice section is larger than that of the non-voice section, the audio signals are sequenced based on the audio energy, and the audio signals with voice can be effectively screened out.
Optionally, in step S506, the server performs mixing processing based on the target audio signals sent by the M terminals, which specifically includes the following several embodiments:
in one possible implementation, the target audio signals sent by M terminals are sequentially decoded, mixed and encoded to obtain a first mixed audio data packet, and the first mixed audio data packet is sent to each other terminal except the M terminals. And respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
Specifically, the target audio signals sent by the M terminals are decoded to obtain M audio digital signals, and optionally, the audio digital signals may be Pulse Code Modulation (PCM) digital signals. Then mixing the M audio digital signals to obtain a first mixed audio digital signal, and then coding the first mixed audio digital signal to obtain a first mixed audio data packet. And respectively decoding target audio signals sent by other M-1 terminals except the target audio signal sent by the target terminal aiming at each target terminal in the M terminals to obtain M-1 audio digital signals, mixing the M-1 audio digital signals to obtain a second mixed audio digital signal, and then coding the second mixed audio digital signal to obtain a second mixed audio data packet.
Illustratively, M is set to be 3, 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where terminal 1 to terminal 4 respectively send audio data packets to the server, and the audio data packets include speech feature information and encoded audio signals. The terminal 5 turns off the microphone and does not send audio data packets to the server. The server receives audio data packets sent by the terminal 1 to the terminal 4, respectively, and the routing module in the server selects 3 target audio signals with voices from the coded audio signals in the 4 audio data packets according to the voice feature information in the 4 audio data packets, wherein the target audio signals are the coded audio signals sent by the terminal 1, the terminal 2 and the terminal 4, respectively. The decoding module in the server decodes the selected 3 target audio signals respectively to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCM digital signal 4.
The audio mixing module in the server mixes the decoded 3 PCM digital signals to obtain a first mixed PCM digital signal, the encoding module in the server encodes the first mixed PCM digital signal to obtain a first mixed audio data packet, and the first mixed audio data packet is sent to the terminal 3 and the terminal 5, as shown in fig. 6 specifically.
A mixing module in the server mixes the PCM digital signal 2 and the PCM digital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 1, a coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 1 to obtain a second mixed audio data packet corresponding to the terminal 1, and the second mixed audio data packet corresponding to the terminal 1 is sent to the terminal 1, as shown in fig. 7.
A mixing module in the server mixes the PCM digital signal 1 and the PCM digital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 2, a coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 2 to obtain a second mixed audio data packet corresponding to the terminal 2, and sends the second mixed audio data packet corresponding to the terminal 2, as shown in fig. 7 specifically.
A mixing module in the server mixes the PCM digital signal 1 and the PCM digital signal 2 to obtain a second mixed PCM digital signal corresponding to the terminal 4, a coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 4 to obtain a second mixed audio data packet corresponding to the terminal 4, and sends the second mixed audio data packet corresponding to the terminal 4, as shown in fig. 7 specifically.
Because different mixed audio data packets are generated and sent for different terminals, the terminals which do not participate in audio mixing can receive audio signals sent by all the terminals which participate in audio mixing, and the terminals which participate in audio mixing receive audio signals except the audio signals sent by the terminals, so that the users which do not speak can hear the voices of other speaking users, and the phenomenon of echo caused by the speaking users hearing own voices is avoided.
In one possible implementation, target audio signals sent by M terminals are sequentially decoded, mixed and encoded to obtain mixed audio data packets, and the mixed audio data packets are sent to each terminal in call connection with the server.
Illustratively, as shown in fig. 8, M is set to be 3, and 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where terminal 1 to terminal 4 respectively send audio data packets to the server, and the audio data packets include speech feature information and encoded audio signals. The terminal 5 turns off the microphone and does not send audio data packets to the server. The server receives audio data packets sent by the terminal 1 to the terminal 4, respectively, and the routing module in the server selects 3 target audio signals with voices from the coded audio signals in the 4 audio data packets according to the voice feature information in the 4 audio data packets, wherein the target audio signals are the coded audio signals sent by the terminal 1, the terminal 2 and the terminal 4, respectively. The decoding module in the server decodes the selected 3 target audio signals respectively to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCM digital signal 4. And a sound mixing module in the server mixes the decoded 3 PCM digital signals to obtain mixed PCM digital signals, and an encoding module in the server encodes the mixed PCM digital signals to obtain mixed audio data packets, and sends the mixed audio data packets to the terminals 1 to 5.
Optionally, the server further includes a packet cache module and an audio cache module, where the packet cache module in the server is configured to cache the target audio signal selected by the routing module in the server, and the maximum cache number may be a fixed number, or may be automatically adjusted according to a network load condition. The audio buffer module in the server is used for buffering the audio digital signals obtained after decoding by the decoding module in the server, and the maximum buffer amount can be a fixed amount or can be automatically adjusted according to the network load condition.
Illustratively, as shown in fig. 9, M is set to be 3, and 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where terminal 1 to terminal 4 respectively send audio data packets to the server, and the audio data packets include speech feature information and encoded audio signals. The terminal 5 turns off the microphone and does not send audio data packets to the server. The server receives audio data packets sent by the terminal 1 to the terminal 4, respectively, and the routing module in the server selects 3 target audio signals with voices from the coded audio signals in the 4 audio data packets according to the voice feature information in the 4 audio data packets, wherein the target audio signals are the coded audio signals sent by the terminal 1, the terminal 2 and the terminal 4, respectively. And a packet buffer module in the server buffers the coded audio signals sent by the terminal 1, the terminal 2 and the terminal 4. The decoding module in the server decodes the selected 3 target audio signals respectively to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCM digital signal 4. And an audio caching module in the server caches the PCM digital signal 1, the PCM digital signal 2 and the PCM digital signal 4. And the audio mixing module in the server mixes the decoded 3 PCM digital signals to obtain mixed PCM digital signals, and the coding module in the server codes the mixed PCM digital signals to obtain mixed audio data packets. The audio signals before and after decoding are buffered by the packet buffer module and the audio buffer module, so that on one hand, the audio data packet is convenient to recover when packet loss occurs, on the other hand, network jitter and delay can be effectively resisted, and the audio sounds more continuous.
Since the server mixes each target audio signal into one mixed audio signal, a user at a terminal side receiving the mixed audio signal cannot distinguish which terminal the voice in the mixed audio signal comes from, and in order to facilitate the user to distinguish the source of the voice in the mixed audio signal, the mixed audio data packet in the embodiment of the application further includes a voice flag corresponding to the target audio signal.
Specifically, the server receives audio data packets sent by the N terminals, and an unpacking module in the server unpacks each received audio data packet to obtain encoded audio signals and voice feature information. When the received voice characteristic information comprises the audio energy and the voice mark, an unpacking module in the server sends the audio energy and the voice mark to a routing module in the server. When the received voice characteristic information comprises audio energy and does not comprise a voice mark, an unpacking module in the server sends the coded audio signal to an analysis module in the server, the analysis module in the server carries out voice detection on the coded audio signal to obtain the voice mark, and then the voice mark is sent to a routing module in the server. An unpacking module in the server sends the audio energy to a routing module in the server.
And the routing module in the server selects target audio signals sent by the M terminals from the coded audio signals in each audio data packet based on the voice characteristic information, and simultaneously sends the voice mark corresponding to each target audio signal to the packing module of the server. And a decoding module in the server decodes the M target audio signals to obtain M PCM digital signals. And a sound mixing module and an encoding module in the server carry out sound mixing and encoding on the M PCM digital signals to obtain an encoded first mixed audio signal. And a sound mixing module and an encoding module in the server respectively aim at each target terminal in the M terminals, and sequentially perform sound mixing and encoding on the PCM digital signals corresponding to the M-1 terminals except the PCM digital signal corresponding to the target terminal to obtain an encoded second mixed audio signal.
And a packaging module of the server packages the coded first mixed audio signal and the voice marks corresponding to the M target audio signals to obtain a first mixed audio data packet. And a packaging module of the server packages the coded second mixed audio signal and the voice marks corresponding to the M-1 target audio signals to obtain a second mixed audio data packet. The packetizing module of the server may specifically add the voice flag to a specified field in the RTP packet header. After the terminal receives the mixed audio data packet, the user at the speaking terminal side is determined according to the voice mark, and then the head portrait of the speaking user is displayed in a highlight or jumping mode.
Exemplarily, as shown in fig. 10, M is set to be 3, and 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where the terminals 1 to 4 respectively transmit audio data packets to the server, the audio data packets include speech feature information and encoded audio signals, and the terminal 5 turns off the microphone and does not transmit the audio data packets to the server. The server receives the audio data packets respectively sent by the terminals 1 to 4. And the unpacking module in the server unpacks the received 4 audio data packets to obtain the coded audio signals and the voice characteristic information. The routing module in the server selects 3 target audio signals from the coded audio signals in the 4 audio data packets based on the voice characteristic information, wherein the 3 target audio signals are respectively a coded audio signal 1, a coded audio signal 2 and a coded audio signal 4, and then sends the voice mark corresponding to each target audio signal to the packing module of the server. A decoding module in the server decodes the 3 target audio signals to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCM digital signal 4. And a sound mixing module and an encoding module in the server sequentially carry out sound mixing and encoding on the 3 PCM digital signals to obtain an encoded first mixed audio signal. And a packaging module of the server packages the coded first mixed audio signal and the voice marks corresponding to the 3 target audio signals to obtain a first mixed audio data packet. And a sound mixing module and an encoding module in the server sequentially perform sound mixing and encoding on the PCM digital signal 1 and the PCM digital signal 2 to obtain an encoded second mixed audio signal. And a packaging module of the server packages the coded second mixed audio signal and the voice marks corresponding to the coded audio signal 1 and the coded audio signal 2 to obtain a second mixed audio data packet. Since one terminal needs to receive the audio signals of 4 other terminals at most, a 4-bit integer field in the RTP packet can be used to store the voice flag, where 0 indicates no voice and 1 indicates voice.
The server sends the first mixed audio data packet to the terminal 3 and the terminal 5, after receiving the first mixed audio data packet, the terminal 3 and the terminal 5 display the user avatars corresponding to the terminal 1, the terminal 2 and the terminal 4 in a highlight form, specifically as shown in fig. 11a, the users corresponding to the terminal 1, the terminal 2 and the terminal 4 are the user 1, the user 2 and the user 4 participating in the conference respectively. The server sends the second mixed audio data packet to the terminal 4, and after receiving the second mixed audio data packet, the terminal 4 displays the user avatars corresponding to the terminal 1 and the terminal 2 in a highlight form, specifically, as shown in fig. 11b, the users corresponding to the terminal 1 and the terminal 2 are the user 1 and the user 2 participating in the conference, respectively.
The server adds the voice mark corresponding to the target audio signal into the mixed audio data packet, sends the mixed audio data packet to the terminal in communication connection with the server, and the terminal receiving the audio data packet displays the head portrait of the terminal side user sending the voice in the forms of highlight, jitter and the like, so that each user can timely know the speaking user, and the interactive experience of the user is improved.
In order to better explain the embodiment of the present application, an audio processing method provided by the embodiment of the present application is introduced below with reference to a specific implementation scenario, where the method is interactively executed by a terminal and a server, where the server includes an unpacking module, a routing module, a packet caching module, a decoding module, an audio caching module, a mixing module, an encoding module, and a packing module, and each module may be implemented on an independent server host or an independent CPU hardware, or may be implemented by being integrated on one server or CPU hardware, and the embodiment of the present application is not particularly limited.
Setting M as 3, and 5 terminals in call connection with the server, namely the terminal 1 to the terminal 5, wherein the terminal 5 closes the microphone and does not send audio data packets to the server. The terminal 1 to the terminal 4 respectively send audio data packets to the server, the audio data packets comprise coded audio signals and voice characteristic information, and the voice characteristic information is obtained after the terminal extracts voice characteristics of the collected audio signals. The following describes a process of generating audio data packets by the terminal 1, as shown in fig. 12. The terminal 1 collects the audio signal through a microphone, then performs echo cancellation and denoising on the audio signal, and then performs voice feature extraction on the audio signal with the echo and noise removed to obtain voice feature information. The terminal 1 performs automatic gain control on the acquired audio signal, adjusts the audio signal to a proper volume, then performs compression coding on the audio signal to obtain a coded audio signal, then packs the coded audio signal and the voice characteristic information together, and then sends a packed audio data packet to a server through an RTP protocol, wherein the voice characteristic information is embedded in a designated field in an RTP packet header.
The server receives the audio data packets respectively sent by the terminals 1 to 4, and an unpacking module in the server unpacks the received 4 audio data packets to obtain encoded audio signals and voice characteristic information, wherein the voice characteristic information comprises audio energy and voice marks. The route selection module in the server firstly screens out the corresponding voice marks as the coded audio signals with voice, if the voice marks corresponding to the 4 coded audio signals are all with voice, the 4 coded audio signals are sequenced according to the sequence of audio energy from large to small, and the sequencing result is set as follows: the encoded audio signal 1 transmitted by the terminal 1, the encoded audio signal 2 transmitted by the terminal 2, the encoded audio signal 4 transmitted by the terminal 4, and the encoded audio signal 3 transmitted by the terminal 3, the encoded audio signal 1 transmitted by the terminal 1, the encoded audio signal 2 transmitted by the terminal 2, and the encoded audio signal 4 transmitted by the terminal 4 are used as target audio signals. The packet buffer module in the server buffers the 3 target audio signals. And the routing module in the server sends the voice marks corresponding to the 3 target audio signals to the packaging module of the server. A decoding module in the server decodes the 3 target audio signals to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCM digital signal 4. And an audio caching module in the server caches the PCM digital signal 1, the PCM digital signal 2 and the PCM digital signal 4.
And the audio mixing module in the server mixes the decoded 3 PCM digital signals to obtain a first mixed PCM digital signal, and the coding module in the server codes the first mixed PCM digital signal to obtain a first coded mixed audio signal. The packing module of the server packs the first encoded mixed audio signal and the voice flags corresponding to the 3 target audio signals to obtain a first mixed audio data packet, and sends the first mixed audio data packet to the terminal 3 and the terminal 5, as shown in fig. 13. After receiving the first mixed audio data packet, the terminal 3 and the terminal 5 display the user head portraits corresponding to the terminal 1, the terminal 2 and the terminal 4 in a highlight mode.
And a sound mixing module in the server mixes the PCM digital signal 2 and the PCM digital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 1, and an encoding module in the server encodes the second mixed PCM digital signal corresponding to the terminal 1 to obtain a second encoded mixed audio signal corresponding to the terminal 1. A packing module of the server packs the second encoded mixed audio signal corresponding to the terminal 1, the encoded audio signal 2 sent by the terminal 2, and the encoded audio signal 4 sent by the terminal 4, which correspond to the voice flag, respectively, to obtain a second mixed audio data packet corresponding to the terminal 1, and sends the second mixed audio data packet corresponding to the terminal 1, as shown in fig. 14 specifically. After receiving the second mixed audio data packet, the terminal 1 displays the user head portraits corresponding to the terminal 2 and the terminal 4 in a highlight mode.
And a sound mixing module in the server mixes the PCM digital signal 1 and the PCM digital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 2. And the coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 2 to obtain a second coded mixed audio signal corresponding to the terminal 2. The packing module of the server packs the second encoded mixed audio signal corresponding to the terminal 2, the encoded audio signal 1 sent by the terminal 1, and the encoded audio signal 4 sent by the terminal 4, which correspond to the voice flag, respectively, to obtain a second mixed audio data packet corresponding to the terminal 2, and sends the second mixed audio data packet corresponding to the terminal 2, as shown in fig. 14 specifically. And after receiving the second mixed audio data packet, the terminal 2 displays the user head portraits corresponding to the terminal 1 and the terminal 4 in a highlight mode.
And a sound mixing module in the server mixes the PCM digital signal 1 and the PCM digital signal 2 to obtain a second mixed PCM digital signal corresponding to the terminal 4. And the coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 4 to obtain a second coded mixed audio signal corresponding to the terminal 4. The packing module of the server packs the second encoded mixed audio signal corresponding to the terminal 4, the encoded audio signal 1 sent by the terminal 1, and the encoded audio signal 2 sent by the terminal 2, which correspond to the voice flag, respectively, to obtain a second mixed audio data packet corresponding to the terminal 4, and sends the second mixed audio data packet corresponding to the terminal 4, as shown in fig. 14 specifically. And after receiving the second mixed audio data packet, the terminal 4 displays the user head portraits corresponding to the terminal 1 and the terminal 2 in a highlight mode.
The terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends audio data packets containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packets and selects a route based on the voice characteristic information.
Based on the same technical concept, an embodiment of the present application provides a server, as shown in fig. 15, where the server 1500 includes:
the receiving module 1501 is configured to receive audio data packets sent by N terminals, where each audio data packet includes voice feature information and a coded audio signal, and the voice feature information is obtained by performing voice feature extraction on an acquired audio signal;
a screening module 1502, configured to select, from the encoded audio signals in each audio data packet, target audio signals sent by M terminals according to the voice feature information corresponding to the encoded audio signals in each audio data packet, where M is a positive integer smaller than N;
a first processing module 1503, configured to perform mixing processing based on the target audio signals sent by the M terminals.
Optionally, the voice feature information at least includes audio energy, and the audio energy is energy of voice in the audio signal.
Optionally, the voice feature information further includes a voice flag, where the voice flag is obtained by performing voice detection on the acquired audio signal by the terminal and sending the audio signal to the server, or is obtained by performing voice detection on the received encoded audio signal by the server.
Optionally, screening module 1502 is specifically configured to:
screening out audio signals with voice marks as voice from the coded audio signals in each audio data packet;
sorting the screened audio signals according to the audio energy from big to small;
the audio signal of the top M bits is determined as a target audio signal.
Optionally, the first processing module 1503 is specifically configured to:
sequentially decoding, mixing and encoding target audio signals sent by the M terminals to obtain a first mixed audio data packet, and respectively sending the first mixed audio data packet to each other terminal except the M terminals;
and respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
Optionally, the first mixed audio data packet further includes voice tags corresponding to M target audio signals, and the second mixed audio data packet further includes voice tags corresponding to M-1 target audio signals.
Optionally, the first processing module 1503 is specifically configured to:
and sequentially decoding, mixing and encoding target audio signals sent by the M terminals to obtain mixed audio data packets, and respectively sending the mixed audio data packets to each terminal connected with the server in a call mode, wherein the mixed audio data packets further comprise voice signs corresponding to the M target audio signals.
Based on the same technical concept, an embodiment of the present application provides a terminal, as shown in fig. 16, where the terminal 1600 includes:
the feature extraction module 1601 is configured to perform voice feature extraction on the acquired audio signal to obtain voice feature information;
a second processing module 1602, configured to perform automatic gain control and coding on the acquired audio information number, and then package the audio information number and the voice feature information to obtain an audio data packet;
the sending module 1603 is configured to send the audio data packets to a server, so that the server selects, according to the voice feature information corresponding to the encoded audio signals in the audio data packets sent by the N terminals, target audio signals sent by M terminals from the encoded audio signals in each audio data packet, where M is a positive integer smaller than N, and performs audio mixing processing based on the target audio signals sent by the M terminals.
Based on the same technical concept, an embodiment of the present application provides an audio processing system, as shown in fig. 17, the audio processing system 1700 includes:
a server 1701 and N terminals 1702;
each terminal 1702 of the N terminals 1702 is configured to perform voice feature extraction on the acquired audio signal to obtain voice feature information; after carrying out automatic gain control and coding on the acquired audio information number, packaging the audio information number and the voice characteristic information to obtain an audio data packet, and sending the audio data packet to the server 1701;
a server 1701, configured to select, according to speech feature information corresponding to encoded audio signals in audio data packets sent by N terminals 1702, target audio signals sent by M terminals 1702 from the encoded audio signals in each audio data packet, where M is a positive integer smaller than N; mixing processing is performed based on the target audio signals transmitted by the M terminals 1702.
Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 18, including at least one processor 1801 and a memory 1802 connected to the at least one processor, where a specific connection medium between the processor 1801 and the memory 1802 is not limited in this embodiment, and the processor 1801 and the memory 1802 in fig. 18 are connected through a bus as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.
In this embodiment, the memory 1802 stores instructions executable by the at least one processor 1801, and the at least one processor 1801 may execute the steps included in the foregoing audio processing method by executing the instructions stored in the memory 1802.
The processor 1801 is a control center of the computer device, and may be connected to various portions of the computer device through various interfaces and lines, and may process audio by executing or executing instructions stored in the memory 1802 and calling data stored in the memory 1802. Optionally, the processor 1801 may include one or more processing units, and the processor 1801 may integrate an application processor and a modem processor, where the application processor mainly handles an operating system, a user interface, application programs, and the like, and the modem processor mainly handles wireless communication. It is to be appreciated that the modem processor described above may not be integrated into the processor 1801. In some embodiments, the processor 1801 and the memory 1802 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 1801 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 1802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1802 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1802 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1802 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program is run on the computer device, causes the computer device to perform the steps of the audio processing method described above.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (15)

1. An audio processing method, comprising:
receiving audio data packets sent by N terminals, wherein each audio data packet comprises voice characteristic information and a coded audio signal, and the voice characteristic information is obtained after voice characteristic extraction is carried out on the collected audio signals;
selecting target audio signals sent by M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in each audio data packet, wherein M is a positive integer smaller than N;
and performing sound mixing processing based on the target audio signals sent by the M terminals.
2. The method of claim 1, wherein the speech characteristic information includes at least audio energy, the audio energy being energy of speech in an audio signal.
3. The method of claim 2, wherein the voice feature information further includes a voice flag, and the voice flag is obtained by performing voice detection on the collected audio signal and sending the voice flag to the server, or is obtained by performing voice detection on the received encoded audio signal by the server.
4. The method as claimed in claim 3, wherein the selecting the target audio signals sent by M terminals from the encoded audio signals in each audio data packet according to the speech characteristic information corresponding to the encoded audio signals in each audio data packet comprises:
screening out audio signals with voice marks as voice from the coded audio signals in each audio data packet;
sorting the screened audio signals according to the audio energy from big to small;
the audio signal of the top M bits is determined as a target audio signal.
5. The method according to any one of claims 1 to 4, wherein the performing mixing processing based on the target audio signals sent by the M terminals specifically includes:
sequentially decoding, mixing and encoding the target audio signals sent by the M terminals to obtain a first mixed audio data packet, and respectively sending the first mixed audio data packet to each other terminal except the M terminals;
and respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
6. The method of claim 5, wherein the first mixed audio data packet further includes voice tags corresponding to M target audio signals, and wherein the second mixed audio data packet further includes voice tags corresponding to M-1 target audio signals.
7. The method according to any one of claims 1 to 4, wherein the performing mixing processing based on the target audio signals transmitted by the M terminals comprises:
and sequentially decoding, mixing and encoding the target audio signals sent by the M terminals to obtain mixed audio data packets, and respectively sending the mixed audio data packets to each terminal connected with the server in a call mode, wherein the mixed audio data packets further comprise voice signs corresponding to the M target audio signals.
8. An audio processing method, comprising:
carrying out voice feature extraction on the collected audio signals to obtain voice feature information;
after carrying out automatic gain control and coding on the acquired audio information number, packaging the audio information number and the voice characteristic information to obtain an audio data packet;
sending the audio data packets to a server so that the server selects target audio signals sent by M terminals from the coded audio signals in each audio data packet according to voice characteristic information corresponding to the coded audio signals in the audio data packets sent by the N terminals, wherein M is a positive integer smaller than N; and performing sound mixing processing based on the target audio signals sent by the M terminals.
9. A server, comprising:
the receiving module is used for receiving audio data packets sent by the N terminals, each audio data packet comprises voice characteristic information and a coded audio signal, and the voice characteristic information is obtained after voice characteristic extraction is carried out on the collected audio signals;
the screening module is used for selecting target audio signals sent by M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in each audio data packet, wherein M is a positive integer smaller than N;
and the first processing module is used for carrying out sound mixing processing on the basis of the target audio signals sent by the M terminals.
10. The server according to claim 9, wherein the screening module is specifically configured to:
screening out audio signals with voice marks as voice from the coded audio signals in each audio data packet;
sorting the screened audio signals according to the audio energy from big to small;
the audio signal of the top M bits is determined as a target audio signal.
11. The server according to claim 9 or 10, wherein the first processing module is specifically configured to:
sequentially decoding, mixing and encoding the target audio signals sent by the M terminals to obtain a first mixed audio data packet, and respectively sending the first mixed audio data packet to each other terminal except the M terminals;
and respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
12. A terminal, comprising:
the characteristic extraction module is used for extracting voice characteristics of the collected audio signals to obtain voice characteristic information;
the second processing module is used for packing the collected audio information number and the voice characteristic information after carrying out automatic gain control and coding on the collected audio information number to obtain an audio data packet;
the sending module is used for sending the audio data packets to a server so that the server selects target audio signals sent by M terminals from the coded audio signals in each audio data packet according to voice characteristic information corresponding to the coded audio signals in the audio data packets sent by the N terminals, wherein M is a positive integer smaller than N; and performing sound mixing processing based on the target audio signals sent by the M terminals.
13. An audio processing system, comprising:
a server and N terminals;
each terminal of the N terminals is used for extracting voice characteristics of the collected audio signals to obtain voice characteristic information; after carrying out automatic gain control and coding on the acquired audio information number, packaging the audio information number and the voice characteristic information to obtain an audio data packet, and sending the audio data packet to a server;
the server is used for selecting target audio signals sent by M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in the audio data packets sent by the N terminals, wherein M is a positive integer smaller than N; and performing sound mixing processing based on the target audio signals sent by the M terminals.
14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 1 to 7 are performed when the program is executed by the processor, or wherein the steps of the method of claim 8 are performed when the program is executed by the processor.
15. A computer-readable storage medium, having stored thereon a computer program executable by a computer device, when the program is run on the computer device, causing the computer device to perform the steps of the method of any one of claims 1 to 7, or causing the computer device to perform the steps of the method of claim 8.
CN202010360286.XA 2020-04-30 2020-04-30 Audio processing method, terminal and server Pending CN111276152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010360286.XA CN111276152A (en) 2020-04-30 2020-04-30 Audio processing method, terminal and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010360286.XA CN111276152A (en) 2020-04-30 2020-04-30 Audio processing method, terminal and server

Publications (1)

Publication Number Publication Date
CN111276152A true CN111276152A (en) 2020-06-12

Family

ID=71001001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010360286.XA Pending CN111276152A (en) 2020-04-30 2020-04-30 Audio processing method, terminal and server

Country Status (1)

Country Link
CN (1) CN111276152A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951821A (en) * 2020-08-13 2020-11-17 腾讯科技(深圳)有限公司 Call method and device
CN112118264A (en) * 2020-09-21 2020-12-22 苏州科达科技股份有限公司 Conference sound mixing method and system
CN114242067A (en) * 2021-11-03 2022-03-25 北京百度网讯科技有限公司 Speech recognition method, apparatus, device and storage medium
CN114285830A (en) * 2021-12-21 2022-04-05 北京百度网讯科技有限公司 Voice signal processing method and device, electronic equipment and readable storage medium
CN116609726A (en) * 2023-05-11 2023-08-18 钉钉(中国)信息技术有限公司 Sound source positioning method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012021574A2 (en) * 2010-08-10 2012-02-16 Blabbelon, Inc. Highly scalable voice conferencing service
CN103338348A (en) * 2013-07-17 2013-10-02 天脉聚源(北京)传媒科技有限公司 Implementation method, system and server for audio-video conference over internet
CN105704338A (en) * 2016-03-21 2016-06-22 腾讯科技(深圳)有限公司 Audio mixing method, audio mixing equipment and system
CN105812713A (en) * 2014-08-28 2016-07-27 三星Sds株式会社 Method for extending participants of multiparty video conference service and MCU gateway
CN111049848A (en) * 2019-12-23 2020-04-21 腾讯科技(深圳)有限公司 Call method, device, system, server and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012021574A2 (en) * 2010-08-10 2012-02-16 Blabbelon, Inc. Highly scalable voice conferencing service
CN103338348A (en) * 2013-07-17 2013-10-02 天脉聚源(北京)传媒科技有限公司 Implementation method, system and server for audio-video conference over internet
CN105812713A (en) * 2014-08-28 2016-07-27 三星Sds株式会社 Method for extending participants of multiparty video conference service and MCU gateway
CN105704338A (en) * 2016-03-21 2016-06-22 腾讯科技(深圳)有限公司 Audio mixing method, audio mixing equipment and system
CN111049848A (en) * 2019-12-23 2020-04-21 腾讯科技(深圳)有限公司 Call method, device, system, server and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951821A (en) * 2020-08-13 2020-11-17 腾讯科技(深圳)有限公司 Call method and device
CN111951821B (en) * 2020-08-13 2023-10-24 腾讯科技(深圳)有限公司 Communication method and device
CN112118264A (en) * 2020-09-21 2020-12-22 苏州科达科技股份有限公司 Conference sound mixing method and system
CN114242067A (en) * 2021-11-03 2022-03-25 北京百度网讯科技有限公司 Speech recognition method, apparatus, device and storage medium
CN114285830A (en) * 2021-12-21 2022-04-05 北京百度网讯科技有限公司 Voice signal processing method and device, electronic equipment and readable storage medium
CN114285830B (en) * 2021-12-21 2024-05-24 北京百度网讯科技有限公司 Voice signal processing method, device, electronic equipment and readable storage medium
CN116609726A (en) * 2023-05-11 2023-08-18 钉钉(中国)信息技术有限公司 Sound source positioning method and device

Similar Documents

Publication Publication Date Title
CN111276152A (en) Audio processing method, terminal and server
CN111049996B (en) Multi-scene voice recognition method and device and intelligent customer service system applying same
US8817061B2 (en) Recognition of human gestures by a mobile phone
KR101353847B1 (en) Method and apparatus for detecting and suppressing echo in packet networks
US8412819B2 (en) Dynamically enabling features of an application based on user status
US9311920B2 (en) Voice processing method, apparatus, and system
CN108370580A (en) Match user equipment and network scheduling period
US20140369528A1 (en) Mixing decision controlling decode decision
CN103973542B (en) A kind of voice information processing method and device
CN112185362A (en) Voice processing method and device for user personalized service
CN107623830A (en) A kind of video call method and electronic equipment
EP2158753B1 (en) Selection of audio signals to be mixed in an audio conference
US9059860B2 (en) Techniques for announcing conference attendance changes in multiple languages
CN112260982A (en) Audio processing method and device
US9042537B2 (en) Progressive, targeted, and variable conference feedback tone
CN112802485B (en) Voice data processing method and device, computer equipment and storage medium
CN115623126A (en) Voice call method, system, device, computer equipment and storage medium
US9204093B1 (en) Interactive combination of game data and call setup
US8782271B1 (en) Video mixing using video speech detection
CN113573004A (en) Video conference processing method and device, computer equipment and storage medium
CN112203039A (en) Processing method and device for online conference, electronic equipment and computer storage medium
CN112433697A (en) Resource display method and device, electronic equipment and storage medium
CN112751819B (en) Processing method and device for online conference, electronic equipment and computer readable medium
US11830120B2 (en) Speech image providing method and computing device for performing the same
CN114793295B (en) Video processing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40023675

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20200612

RJ01 Rejection of invention patent application after publication