CN112509603A

CN112509603A - Voice quality assessment method, device and system

Info

Publication number: CN112509603A
Application number: CN202011400170.0A
Authority: CN
Inventors: 吕非彼; 朱佳佳; 田元兵; 乔金剑; 刘亮; 马昱
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-03-16
Anticipated expiration: 2040-12-01
Also published as: CN112509603B

Abstract

The application provides a voice quality assessment method, a device and a system, which relate to the technical field of communication, and can ensure the accuracy of voice quality assessment and reduce the cost of voice quality assessment. The method comprises the following steps: receiving the recording information of M calls acquired by terminal equipment; the logging information for a call includes: conversation linguistic data of the conversation, a call connection timestamp and a call release timestamp; m is greater than or equal to 1; inputting the recorded information of the M calls into an MOS algorithm model, and calculating to obtain the voice quality values of the M calls; the MOS algorithm model supports the parallel evaluation of the voice quality of N calls; n is greater than or equal to 1.

Description

Voice quality assessment method, device and system

Technical Field

Embodiments of the present application relate to the field of communications technologies, and in particular, to a method, an apparatus, and a system for evaluating voice quality.

Background

With the continuous development of communication technology, the requirement for call quality is higher and higher. In order to accurately evaluate the voice quality during a call, the current industry mainly adopts a standard opinion score (MOS) voice quality evaluation scheme and a non-MOS voice quality evaluation scheme to evaluate the voice quality during a terminal call.

The main principle of the standard MOS voice quality evaluation scheme is as follows: the testing computer controls the calling terminal to call the called terminal, when the calling is connected, a preset standard corpus is played at the calling terminal side, the called terminal answers in real time and sends the answering content to the MOS box, then the MOS box compares the received voice with the standard corpus in real time through the MOS algorithm model, and the MOS score of the voice is calculated.

The main principle of the non-MOS voice quality evaluation scheme is as follows: the testing computer controls the calling terminal to call the called terminal, in the voice communication process of the calling terminal and the called terminal, the testing computer carries out Deep Packet Inspection (DPI) through a real-time transport protocol (RTP), calculates indexes such as packet loss rate, time delay and jitter in the voice communication, and then calculates the voice quality of the communication according to the indexes such as the packet loss rate, the time delay and the jitter by adopting an own algorithm.

It can be seen that for the standard MOS voice quality assessment scheme, when assessing the voice quality, an MOS box needs to be configured; resulting in a standard MOS voice quality assessment scheme with a large procurement cost.

For the non-MOS voice quality evaluation scheme, although an MOS box is not needed in the implementation and the cost is low, the self-owned algorithm is generally adopted to replace the standard MOS algorithm; the accuracy of the speech quality assessment cannot be guaranteed.

Therefore, at present, no voice quality assessment scheme which can ensure the accuracy of voice assessment and has lower cost exists.

Disclosure of Invention

The application provides a voice quality assessment method, device and system, which can ensure the accuracy of voice assessment and reduce the cost of voice quality assessment.

The technical scheme is as follows:

in a first aspect, the present application provides a voice quality assessment method, which may be applied to a voice quality assessment server, and the method may include: receiving recording information of M long-term evolution voice bearer calls acquired by terminal equipment; the logging information for a call includes: conversation linguistic data of the conversation, a call connection timestamp and a call release timestamp; m is greater than or equal to 1; inputting the recorded information of the M calls into a Mean Opinion Score (MOS) algorithm model, and calculating to obtain voice quality values of the M calls; the MOS algorithm model supports the parallel computation of the voice quality values of N calls; n is greater than or equal to 1.

According to the voice quality evaluation method provided by the application, the MOS algorithm model is directly integrated in the voice quality evaluation server, and the voice quality evaluation server inputs the recorded information into the MOS algorithm model after receiving the recorded information of the call collected by the terminal equipment, and calculates the voice quality value of the call. On one hand, the voice quality value is calculated by adopting the MOS algorithm, so that the industrial standard can be met, and the accuracy of voice quality evaluation is ensured; on the other hand, the cost for purchasing the MOS box is reduced, and the cost for voice quality evaluation is reduced.

With reference to the first aspect, in a possible implementation manner, if M is greater than N, N is greater than 1; inputting the recorded information of the M calls into the MOS algorithm model, and calculating the voice quality values of the M calls, which may include: distributing the recording information of the M calls to N voice quality evaluation queues; and inputting the recorded information of the calls included in the N voice quality evaluation queues to the MOS algorithm model in parallel, and calculating to obtain the voice quality values of the M calls. In this possible implementation manner, if the number of pieces of recorded information of the received call is huge, for example, the number is greater than the number of speech quality values that are calculated in parallel and supported by the speech quality evaluation server, the huge number of speech quality values are calculated in a queuing manner; the processing efficiency of the voice quality evaluation server is improved.

With reference to the first aspect or one of the foregoing possible implementation manners, in another possible implementation manner, the sequentially inputting the recording information of the calls included in the N voice quality evaluation queues to the MOS algorithm model in parallel, and calculating the voice quality values of the M calls may include: aiming at a first voice quality evaluation queue, serially inputting the recording information of one or more calls in the first voice quality evaluation queue to an MOS algorithm model according to a preset sequence, and calculating to obtain the voice quality value of one or more calls in the first voice quality evaluation queue; the first voice quality assessment queue is any one of N voice quality assessment queues. In the possible implementation mode, the voice quality value of the call in each queue is processed in a serial mode, the preset sequence of the serial processing can be configured according to actual requirements, the processing flexibility is improved, and the processing process is ordered and efficient.

With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the preset sequence includes: the priority of the call is in order from high to low. In this possible implementation, the voice quality values of the calls in each queue are processed in order of the priority of the call from high to low, so that the call with high priority is processed preferentially.

With reference to the first aspect or any one of the foregoing possible implementation manners, in another possible implementation manner, the call may include: a voice over long-term evolution (VOLTE) call; or, a new Voice Over New Radio (VONR) call.

In a second aspect, the present application further provides a voice quality assessment apparatus, which may be the voice quality assessment server in the foregoing first aspect or any one of the possible implementations of the first aspect, or may be deployed in the voice quality assessment server. The apparatus may include a receiving unit and a processing unit. Wherein:

the receiving unit can be used for receiving the recording information of the M long-term evolution voice bearer calls acquired by the terminal equipment; the logging information for a call includes: conversation linguistic data of the conversation, a call connection timestamp and a call release timestamp; m is greater than or equal to 1.

The processing unit can input the recorded information of the M calls into the MOS algorithm model and calculate the voice quality values of the M calls; the MOS algorithm model supports the parallel computation of the voice quality values of N calls; n is greater than or equal to 1.

It should be noted that, the speech quality assessment apparatus provided in the second aspect is configured to execute the speech quality assessment method provided in the first aspect or any one of the possible implementation manners of the first aspect, and specific implementation of the first aspect may refer to the specific implementation of the first aspect, and details are not described here again.

In a third aspect, the present application provides a voice quality assessment server, and the apparatus may include a processor, configured to implement the voice quality assessment method described in the first aspect. The apparatus may further comprise a memory coupled to the processor, and the processor may implement the voice quality assessment method described in the first aspect or any of the possible implementations of the first aspect when executing the instructions stored in the memory. The device may also include a communication interface for the apparatus to communicate with other devices, which may be, for example, a transceiver, circuit, bus, module, or other type of communication interface. In one possible implementation, the apparatus may include:

a memory may be used to store instructions.

The processor can be used for inputting the recorded information of the M calls into the MOS algorithm model and calculating the voice quality values of the M calls; the MOS algorithm model supports the parallel computation of the voice quality values of N calls; n is greater than or equal to 1.

The processor can also be used for receiving the recording information of the M long-term evolution voice bearer calls acquired by the terminal equipment; the logging information for a call includes: conversation linguistic data of the conversation, a call connection timestamp and a call release timestamp; m is greater than or equal to 1.

In the present application, the instructions in the memory may be stored in advance, or may be downloaded from the internet and stored when the apparatus is used. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in an electrical, mechanical or other form, and is used for information interaction between the devices, units or modules.

In a fourth aspect, a speech quality assessment system is provided, which may include a speech quality assessment apparatus and a terminal device, where the speech quality assessment apparatus may be the apparatus in the second aspect or any possible implementation manner of the second aspect.

In a fifth aspect, a voice quality assessment system is provided, where the system may include a voice quality assessment server and a terminal device, and the voice quality assessment server may be the device in any possible implementation manner of the third aspect or the third aspect.

In a sixth aspect, an embodiment of the present application further provides a computer-readable storage medium, which includes instructions, when executed on a computer, cause the computer to perform the voice quality assessment method according to any one of the above aspects or any one of the possible implementation manners.

In a seventh aspect, an embodiment of the present application further provides a computer program product, which when run on a computer, causes the computer to execute the voice quality assessment method according to any one of the above aspects or any one of the possible implementations.

In an eighth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor and may further include a memory, and is configured to implement the functions performed by the voice quality assessment server in the foregoing method. The chip system may be formed by a chip, and may also include a chip and other discrete devices.

The solutions provided in the second aspect to the eighth aspect are used for implementing the voice quality assessment method provided in the first aspect, and therefore, the same beneficial effects as those of the first aspect can be achieved, and are not described herein again.

It should be noted that, on the premise of not contradicting the scheme, various possible implementation manners of any one of the above aspects may be combined.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Wherein the connecting lines in the figures only indicate that communication is possible between two devices. The specific communication mode may be wireless communication or wired communication; can be determined according to actual conditions.

Fig. 1 is a schematic structural diagram of a standard MOS voice quality assessment scenario provided in the prior art;

fig. 2 is a schematic structural diagram of a non-MOS speech quality assessment scenario provided in the prior art;

fig. 3 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice quality assessment server according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a speech quality assessment method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another speech quality assessment method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech quality assessment apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another voice quality assessment server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiments of the present application, for convenience of clearly describing the technical solutions of the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items with substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. The technical features described in the first and second descriptions have no sequence or magnitude order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.

In the description of the present application, a "/" indicates a relationship in which the objects associated before and after are an "or", for example, a/B may indicate a or B; in the present application, "and/or" is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

In the embodiments of the present application, at least one may also be described as one or more, and a plurality may be two, three, four or more, which is not limited in the present application.

For ease of understanding, the existing speech quality assessment scheme is first introduced.

Firstly, a standard MOS voice quality evaluation scheme is briefly explained.

As shown in fig. 1, the standard MOS voice quality evaluation system mainly includes a testing computer and one or more sets of voice quality testing devices. The group of voice quality testing devices can comprise a calling terminal, a called terminal and an MOS box. The calling terminal and the called terminal are respectively connected with and communicate with the test computer; the MOS box is connected and communicated with a called terminal in the voice quality testing device where the MOS box is located; the calling terminal in the same group of voice quality testing devices can communicate with the called terminal.

Specifically, the test computer sends an instruction 1 to the calling terminal; the calling terminal receives the instruction 1 and initiates a call to the called terminal according to the instruction of the instruction 1; the called terminal answers the call initiated by the calling terminal; after the calling terminal detects that the called terminal answers, the calling terminal plays a pre-stored standard corpus in the call; the called terminal answers the content of the call in real time and sends the content of the call (call corpus) to the MOS box in real time; the method comprises the steps that an MOS box receives a conversation corpus sent by a called terminal, the received conversation corpus is compared with a standard corpus through an MOS algorithm model, and an MOS value of the conversation is calculated; and the MOS box forwards the calculated MOS value of the call to a test computer through the called terminal.

A brief description of a non-MOS speech quality assessment scheme will now be provided.

As shown in fig. 2, the non-MOS speech quality assessment system mainly includes a testing computer, a plurality of main terminals and a plurality of called terminals. The calling terminal and the called terminal can be respectively connected with the testing computer and communicate with each other, and the calling terminal can also communicate with the called terminal.

Specifically, a calling terminal sends a service establishment request to a test computer to request voice communication with a called terminal, and when the test computer receives the service establishment request sent by the calling terminal and determines that the service type corresponding to the service establishment request is voice communication, on one hand, the test computer indicates the calling terminal to initiate voice call to the called terminal; the called terminal answers; after detecting that the called terminal answers, the calling terminal loads the voice data with the encoded conversation content as a payload to a content part of a real-time transport protocol (RTP) packet, encapsulates the content part with a corresponding RTP packet header and transmits the RTP packet header to the called terminal, and the called terminal performs protocol analysis and data decoding after receiving the data to restore the voice content (conversation corpus). On the other hand, the testing computer respectively counts the packet loss rate of the RTP packets sent by the calling terminal in the conversation process of the calling terminal and the called terminal; and evaluating the voice quality of the call according to the packet loss rate of the RTP packet by adopting an own algorithm.

According to the two schemes, for the standard MOS voice quality evaluation scheme, an MOS box is required to be configured when the voice quality is evaluated; the standard MOS voice quality evaluation scheme has higher purchase cost due to the high purchase cost of the MOS box and the high authorization cost of the standard MOS algorithm; and, the more terminal groups that need to evaluate the voice quality, the higher the cost. For example, if it is assumed that the speech quality of nationwide calls needs to be evaluated, if 31 provinces each save a test team and each test team carries two sets of MOS boxes, 62 MOS boxes need to be purchased, which results in a very high purchase cost.

For a non-MOS voice quality evaluation scheme, although a MOS box is not needed in the implementation and the cost is low, the self-owned algorithm is generally adopted to replace the standard MOS algorithm. Therefore, on one hand, the industry standard is MOS algorithm, and the industry standard alignment is difficult to carry out by the own algorithm; on the other hand, the self-contained algorithm has no rigorous demonstration on the accuracy of the voice quality evaluation, and the accuracy of the voice quality evaluation cannot be ensured.

Based on the above, the application provides a voice quality evaluation method, which directly integrates an MOS algorithm model into a voice quality evaluation server, and the voice quality evaluation server inputs the recorded information into the MOS algorithm model after receiving the recorded information of the call collected by the terminal equipment, and calculates the voice quality value of the call. On one hand, the voice quality value is calculated by adopting the MOS algorithm, so that the industrial standard can be met, and the accuracy of voice quality evaluation is ensured; on the other hand, the cost for purchasing the MOS box is reduced, and the cost for voice quality evaluation is reduced.

In order to facilitate understanding of the implementation process of the scheme in the embodiment of the present application, a network architecture in the embodiment of the present application is first described. The voice quality evaluation method in the embodiment of the present application may be applied to the following network architecture.

It should be noted that the network architecture and the scenario are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided by the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is also applicable to similar architectures and scenarios.

As shown in fig. 3, a schematic structural diagram of a network architecture is provided. As shown in fig. 3, the voice quality assessment system 30 may include one or more calling devices 301, one or more called devices 302, and a voice quality assessment server 303. Wherein one or more calling devices 301 may communicate with a voice quality assessment server 303; one or more called devices 302 may communicate with a voice quality assessment server 303; calling device 301 may communicate with called device 302.

Specifically, the calling device 301 may also be referred to as a calling terminal or a calling terminal device. Calling device 301 may be configured to communicate with voice quality assessment server 303; the calling device 301 may also be used to communicate with the called device 302. For example, calling device 301 may receive an instruction sent by voice quality assessment server 303; the calling device 301 may also be used to talk to the called device 302. The calling device 301 may include, but is not limited to, a mobile phone (mobile phone), a tablet computer (tablet computer), a wearable device (such as a smart watch, a smart band), and other devices with voice call functions.

The called device 302, which may also be referred to as a called terminal, or called terminal device. Called device 302 may be used to communicate with voice quality assessment server 303; the called device 302 may also be used to communicate with the calling device 301. For example, the called device 302 may send the recording information of the call it collects to the voice quality assessment server 303; the called device 302 may also be used to talk to the calling device 301. The called device 302 may include, but is not limited to, a mobile phone (mobile phone), a tablet computer (tablet computer), a wearable device (such as a smart watch and a smart band), and other devices with voice communication functions.

It is to be understood that the calling device is only a functional description of the terminal device and should not constitute the only limitation to the terminal device. A terminal device may be either a calling device or a called device. For example, a terminal device may be a calling device at a first time and a called device at a second time.

And a voice quality evaluation server 303, which may be used for communicating with the calling device 301 and the called device 302. For example, the voice quality evaluation server 303 may be configured to receive log information of calls collected by the called device 302; the voice quality assessment server 303 may also be used to send instructions to the calling device 301. The voice quality evaluation server 303 may be various physical servers, or a cloud server.

It should be noted that, in the embodiment of the present application, the number, the connection mode, and the like of each device included in the network architecture are not specifically limited; the network architecture shown in fig. 3 is only an exemplary architecture diagram.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In one aspect, an embodiment of the present application provides a speech quality assessment apparatus for executing the speech quality assessment method provided by the present application. The voice quality evaluation device may be the voice quality evaluation server 303 of fig. 3; alternatively, the voice quality evaluation apparatus may be deployed in the voice quality evaluation server 303 of fig. 3; alternatively, the voice quality evaluation apparatus may be another device that can exchange information with the voice quality evaluation server 303 of fig. 3.

Fig. 4 is a schematic structural diagram of a voice quality assessment server according to an embodiment of the present disclosure, and as shown in fig. 4, the voice quality assessment server 40 may include at least one processor 41, a memory 42, a communication interface 43, and a communication bus 44. The following specifically describes each component of the voice quality estimation server 40 with reference to fig. 4:

the processor 41 may be a single processor or may be a general term for a plurality of processing elements. For example, the processor 41 is a Central Processing Unit (CPU), and may be an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, such as: one or more microprocessors (digital signal processors, DSPs), or one or more Field Programmable Gate Arrays (FPGAs).

The processor 41 may perform various functions by running or executing software programs stored in the memory 42, and calling up data stored in the memory 42, among other things. In particular implementations, processor 41 may include one or more CPUs such as CPU0 and CPU1 shown in fig. 4 as one example.

In particular implementations, the voice quality assessment server 40 may include a plurality of processors, such as processor 41 and processor 45 shown in FIG. 4, as one embodiment. Each of these processors may be a single-Core Processor (CPU) or a multi-Core Processor (CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The memory 42 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 42 may be self-contained and coupled to the processor 41 via a communication bus 44. The memory 42 may also be integrated with the processor 41. The memory 42 is used for storing software programs for executing the scheme of the application, and is controlled by the processor 41 to execute.

The communication interface 43 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The communication bus 44 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

It is noted that the components shown in fig. 4 do not constitute a limitation of the voice quality assessment server, and the voice quality assessment server may include more or less components than those shown in fig. 4, or combine some components, or a different arrangement of components than those shown in fig. 4.

Specifically, the processor 41 performs the following functions by running or executing software programs and/or modules stored in the memory 42 and calling data stored in the memory 42:

receiving the recording information of M calls acquired by terminal equipment; the logging information for a call includes: conversation linguistic data of the conversation, a call connection timestamp and a call release timestamp; m is greater than or equal to 1; inputting the recorded information of the M calls into an MOS algorithm model, and calculating to obtain the voice quality values of the M calls; the MOS algorithm model supports the parallel evaluation of the voice quality of N calls; n is greater than or equal to 1.

On the other hand, the embodiment of the present application provides a voice quality assessment method, which can be applied to the voice quality assessment server 40 shown in fig. 4, for assessing the voice quality of a call.

Among other things, calls may include, but are not limited to: VOLTE conversation; alternatively, VONR conversation; or other call.

It should be noted that the voice quality evaluation server 40 runs a MOS algorithm model, and the MOS algorithm model supports parallel evaluation of voice quality of N calls.

Wherein N is greater than or equal to 1.

The MOS algorithm model may include, but is not limited to, a perceptual objective hearing quality assessment (POLQA) algorithm, or a voice quality perception assessment (PESQ) algorithm.

Illustratively, the voice quality evaluation server 40 has the POLQA algorithm installed therein, and purchases and authorizes the MPOLQA algorithm to support parallel processing of N inputs; i.e., the MPOLQA algorithm supports parallel assessment of the voice quality of N calls.

Specifically, when the voice quality evaluation method provided by the embodiment of the application is used for calculating the voice quality of a call, first, the voice quality evaluation server instructs each terminal device to collect recording information of the call, then, the recording information of the call collected by each terminal device is sent to the voice quality evaluation server, and the voice quality evaluation server calculates the voice quality of the call according to the collected recording information of the call.

The process of the voice quality evaluation server instructing each terminal device to collect the recording information of the call is described in detail in S503 to S510, which is not described herein again.

As shown in fig. 5, the method may include:

s501, the voice quality evaluation server receives the recording information of the M calls collected by the terminal equipment.

Wherein M is greater than or equal to 1.

The logging information for a call may include: the method comprises the following steps of conversation linguistic data of a conversation, a call connection timestamp and a call release timestamp.

Specifically, the voice quality evaluation server receives the recording information of the M calls acquired by the terminal device through offline copy, network transmission or other transmission modes.

Wherein, a terminal device can collect the recording information of one or more calls.

In a possible implementation manner, when M is equal to 1, the recording information of M calls is collected by one terminal device.

In another possible implementation manner, when M is greater than 1, the recording information of M calls may be acquired by one terminal device, or may be acquired by multiple terminal devices.

S502, the voice quality evaluation server inputs the recorded information of the M calls into the MOS algorithm model, and the voice quality values of the M calls are calculated.

Among them, the implementation of S502 may include, but is not limited to, method 1 or method 2 described below.

Specifically, when M is less than or equal to N, the speech quality values of M calls may be calculated by using method 1; when M is greater than N, the speech quality values of M calls can be calculated by using method 2.

The method comprises the steps that 1, a voice quality evaluation server inputs recording information of M calls into an MOS algorithm model supporting parallel evaluation of N calls, the MOS algorithm model processes the recording information of the M calls in parallel, a corpus of a call corpus in each call recording information between a connection time stamp and a release time stamp is obtained and serves as a corpus to be tested of the call, the corpus to be tested of each call is compared with a standard corpus, and a voice quality value of the M calls is obtained through calculation.

The method 2, the voice quality evaluation server distributes the recording information of the M calls to N voice quality evaluation queues; and inputting the recorded information of the calls included in the N voice quality evaluation queues to the MOS algorithm model in parallel, and calculating the voice quality values of the M calls.

In the method 2, for the first voice quality assessment queue, calculating that the first voice quality assessment queue includes the voice quality value of the call may be implemented as: and serially inputting the recording information of one or more calls in the first voice quality evaluation queue to an MOS algorithm model according to a preset sequence, and calculating to obtain the voice quality value of one or more calls in the first voice quality evaluation queue.

Wherein the first voice quality evaluation queue is any one of the N voice quality evaluation queues.

Specifically, the MOS algorithm model obtains a corpus of a first call in the queue between a connection timestamp and a release timestamp as a corpus to be tested of the call, compares the corpus to be tested of the call with a standard corpus, and calculates a voice quality value of the call; then, the voice quality value of the next call in the queue is obtained in the same way; and sequentially performing polling calculation to obtain a first voice quality evaluation queue comprising the voice quality value of each call.

The preset order may include, but is not limited to: the priority of the call is in the order from high to low; or the order of the call time from morning to evening, or other orders.

According to the voice quality evaluation method, the MOS algorithm model is directly integrated in the voice quality evaluation server, and after the voice quality evaluation server receives the recorded information of the call collected by the terminal equipment, the recorded information is input into the MOS algorithm model module, and the voice quality value of the call is calculated. On one hand, the voice quality value is calculated by still adopting the MOS algorithm model, so that the industrial standard can be met, and the accuracy of voice quality evaluation is ensured; on the other hand, the cost for purchasing the MOS box is reduced, and the cost for voice quality evaluation is reduced.

Now, taking the example of obtaining the recorded information of the call between the first terminal and the second terminal as an example, a process of instructing each terminal device to collect the recorded information of the call by the voice quality assessment server will be described.

As shown in fig. 6, the process may include S503 to S510 described below.

S503, the voice quality evaluation server configures standard linguistic data.

In one possible implementation, the speech quality assessment server configures the corpus input by the user as a standard corpus.

For example, the voice quality evaluation server receives the corpus input by the user through the usb disk, and stores the corpus as a standard corpus.

In another possible implementation manner, a plurality of corpora are prestored in the voice quality assessment server, and a standard corpus is selected based on a first operation of the user.

For example, a plurality of corpora are stored in the voice quality assessment server in advance, and the user selects a standard corpus by clicking an identifier or a file name of a certain corpus.

Optionally, the voice quality assessment server may also configure other parameters of the MOS algorithm model. Other parameters of the MOS algorithm model may include, but are not limited to, one or more of the following: bandwidth information, encoding mode.

Illustratively, the bandwidth information is configured to 12.2 kilohertz (kHz) according to the input of the user, and the encoding mode is configured to adaptive Multi-Rate coding (AMR).

It should be noted that, if other parameters of the MOS algorithm model are not configured, the default parameters of the system are used for evaluation calculation.

S504, the voice quality evaluation server sends a first instruction to the first terminal device.

The first terminal device is any calling device.

In one possible implementation manner, S504 may be implemented as: the voice quality evaluation server sends a first instruction to the first terminal device to indicate the first terminal device to initiate a call to the second terminal device.

It should be noted that, when the speech quality assessment server and the first terminal device negotiate in advance and store the standard corpus, in other words, when the first terminal can determine the standard corpus, the speech quality assessment server only sends the first instruction to the first terminal device.

And S505, the first terminal equipment receives a first instruction sent by the voice quality evaluation server.

In S505, the first instruction received by the first terminal device is the first instruction sent by the voice quality assessment server in S505.

Optionally, when the first terminal cannot determine the standard corpus, the method for evaluating speech quality according to the embodiment of the present application may further include S504A and S505A.

S504 and 504A, the voice quality evaluation server sends the standard corpus to the first terminal device, or sends the identifier of the standard corpus, so that the first terminal device determines the standard corpus.

The execution timing of S504A may be configured according to actual requirements, which is not limited in this embodiment of the present application.

For example, S504A may be performed after S504, or S504A may be performed before S504, or simultaneously with S504.

S504A may be implemented as: the voice quality evaluation server sends the standard corpus to the first terminal device in a network transmission or offline copy mode, or sends the identifier of the standard corpus in a network transmission or offline copy mode, so that the first terminal device determines the standard corpus.

S505A, the first terminal device receives the standard corpus or the identifier of the standard corpus sent by the voice quality evaluation server.

The content received by the first terminal device in S505A is the content transmitted by the voice quality assessment server in S504A.

S506, the first terminal device initiates a call to the second terminal.

Specifically, the first terminal device initiates a call to the second terminal according to the instruction of the first instruction.

Illustratively, the first terminal dials a call number of the first terminal device according to the indication of the first instruction, and initiates a call.

And S507, the second terminal equipment connects the call initiated by the first terminal equipment.

Illustratively, the user clicks or slides the relevant position in the second terminal device, and the second terminal device responds to the clicking or sliding operation of the user to connect the call initiated by the first terminal device.

And S508, the first terminal equipment transmits the standard corpus to the second terminal equipment through conversation.

Specifically, the first terminal device encodes the standard corpus as a payload, loads the payload to a content portion of an RTP protocol packet, encapsulates the RTP packet header, and transmits the RTP packet header to the second terminal device.

The specific encoding method may be configured according to actual requirements, which is not specifically limited in the embodiments of the present application. For example, the encoding method may be AMR.

And S509, the second terminal device collects and stores the call record information.

Specifically, the second terminal device collects the full RTP frames sent by the first terminal device as received corpora, and collects a call connection timestamp and a call release timestamp (hang-up timestamp); and saved as the recording information of the call in the second terminal device.

Illustratively, the log information for the call may be stored in the form of table 1.

TABLE 1

It should be noted that table 1 illustrates an exemplary form of saving the call log information, and should not be construed as a unique limitation to the saving form.

Optionally, the information for saving the record of the call may further include a priority of the call.

Illustratively, the log information of the call may also be stored in the form of table 2.

TABLE 2

It should be noted that table 2 illustrates an exemplary form of saving the call log information, and should not be construed as a unique limitation to the saving form.

Illustratively, the log information of the call may be saved in the form of a log (log) file.

And S510, the second terminal equipment sends the recording information of the call to the voice quality evaluation server.

Specifically, the second terminal device may send the recording information of the call to the voice quality assessment server through network transmission, offline copy, or other manners.

It can be understood that, when the voice quality of multiple calls needs to be evaluated, S503 to S510 are executed multiple times to obtain the recorded information of the multiple calls, and then the recorded information of the multiple calls is sent to the voice quality evaluation server in a network transmission or offline copy manner, so that the voice quality evaluation server calculates the voice quality values of the multiple calls.

The speech quality assessment method provided by the present application is briefly described below by way of specific embodiments.

Assuming that the voice quality of the calls of 31 province and meeting cities needs to be tested, and each province is provided with one group of test team, 31 groups of test teams respectively obtain 31 log files by the methods from S503 to S510; a log file contains log information of a call. Then 31 test teams copy 31 log files to the voice quality assessment server for assessment analysis of voice quality.

The voice quality evaluation server supports parallel evaluation of voice quality of 5 calls and supports input of recording information of the parallel 5 calls.

Specifically, the voice quality evaluation server allocates the recorded information of 31 calls in 31 log files to 5 voice quality evaluation queues, respectively. Wherein, the voice quality evaluation queue 1 comprises the record information of call 1, the record information of call 2, the record information of call 3, the record information of call 4, the record information of call 5, the record information of call 6 and the record information of call 7; the voice quality evaluation queue 2 includes log information of call 8, log information of call 9, log information of call 10, log information of call 11, log information of call 12, log information of call 13; the voice quality evaluation queue 3 includes log information of call 14, log information of call 15, log information of call 16, log information of call 17, log information of call 18, and log information of call 19; the voice quality evaluation queue 4 includes log information of call 20, log information of call 21, log information of call 22, log information of call 23, log information of call 24, and log information of call 25; the speech quality assessment queue 5 includes log information for call 26, log information for call 27, log information for call 28, log information for call 29, log information for call 30, and log information for call 31.

The voice quality evaluation server sequentially inputs the recording information of the call 1 into an input channel of an MOS algorithm model corresponding to the voice quality evaluation queue 1 in parallel; inputting the recording information of the call 8 into an input channel of an MOS algorithm model corresponding to the voice quality evaluation queue 2; inputting the recording information of the call 14 into an input channel of an MOS algorithm model corresponding to the voice quality evaluation queue 3; inputting the recording information of the call 20 into an input channel of an MOS algorithm model corresponding to the voice quality evaluation queue 4; inputting the recorded information of the call 26 into an input channel of an MOS algorithm model corresponding to the voice quality evaluation queue 5; the MOS algorithm model computes the speech quality values for call 1, call 8, call 14, call 20, and call 26 in parallel.

Then, the voice quality evaluation server respectively inputs the recording information of the next call of the 5 voice quality evaluation queues to 5 input channels of the MOS algorithm model, and the voice quality values of the 31 calls are calculated through parallel calculation and sequential polling; further, the average value of the voice quality values of the calls in nationwide 31 provinces is output.

The above-mentioned scheme provided by the embodiment of the present invention is introduced mainly from the perspective of the implementation principle of the interaction between the voice quality evaluation server and the terminal device in the network. It is understood that the voice quality assessment server, in order to implement the above-described functions, includes a corresponding hardware structure and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the present invention can be implemented in hardware or a combination of hardware and computer software, with the exemplary elements and algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiment of the present invention, the voice quality assessment apparatus and the like may be divided into functional modules according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

In the case of adopting the functional modules divided corresponding to the respective functions, fig. 7 shows a voice quality evaluation apparatus 70 provided in the embodiment of the present application, which is used for implementing the functions of the voice quality evaluation server in the above embodiments. The voice quality evaluation device 70 may be a voice quality evaluation server; alternatively, the voice quality evaluation device 70 may be disposed in a voice quality evaluation server. As shown in fig. 6, the voice quality evaluation device 70 may include: a receiving unit 701 and a processing unit 702. The receiving unit 701 is configured to perform S501 in fig. 5 or fig. 6; the processing unit 702 is configured to execute S502 in fig. 5 or fig. 6. All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

In the case of using an integrated unit, fig. 8 shows a voice quality assessment server 80 provided in the embodiment of the present application, which is used to implement the function of the voice quality assessment server in the above method. The voice quality evaluation server 80 may include at least one processing module 801 for implementing the functions of the voice quality evaluation server in the embodiment of the present application. For example, the processing module 801 may be configured to execute the process S502 in fig. 5 or fig. 6, refer to the detailed description in the method example specifically, and are not described herein again.

Voice quality assessment server 80 may also include at least one memory module 802 for storing program instructions and/or data. The memory module 802 is coupled with the processing module 801. The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, and may be an electrical, mechanical or other form for information interaction between the devices, units or modules. The processing module 801 may cooperate with the memory module 802. The processing module 801 may execute program instructions stored in the memory module 802. At least one of the at least one memory module may be included in the processing module.

The voice quality assessment server 80 may also include a communication module 803 for communicating with other devices over a transmission medium to determine that the voice quality assessment server 80 may communicate with other devices. The communication module 803 is used for the device to communicate with other devices. For example, the processing module 801 may execute the process S501 in fig. 5 or fig. 6 by using the communication module 803.

In practical implementation, the receiving unit 701 and the processing unit 702 may be implemented by the processor 41 shown in fig. 4 calling the program code in the memory 42. Alternatively, the method may be implemented by the processor 41 shown in fig. 4 through the communication interface 43, and the specific implementation process may refer to the description of the voice quality assessment method portion shown in fig. 5 or fig. 6, which is not described herein again.

As described above, the voice quality assessment apparatus 70 or the voice quality assessment server 80 provided in the embodiments of the present application may be used to implement the functions of the voice quality assessment server 80 in the methods implemented in the embodiments of the present application, and for convenience of description, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed, please refer to the embodiments of the present application.

Other embodiments of the present application provide a voice quality assessment system, where the system may include a voice quality assessment apparatus and a terminal device, where the voice quality assessment apparatus may implement the function of the voice quality assessment server in the foregoing embodiments, for example, the voice quality assessment apparatus may be the voice quality assessment server described in the embodiments of the present application.

Other embodiments of the present application provide a chip system, where the chip system includes a processor and may further include a memory, and is configured to implement the functions of the voice quality assessment server in the embodiments shown in fig. 5 or fig. 6. The chip system may be formed by a chip, and may also include a chip and other discrete devices.

Still other embodiments of the present application provide a computer-readable storage medium, which may include a computer program that, when executed on a computer, causes the computer to perform the steps performed by the speech quality assessment server in the embodiments of fig. 5 or fig. 6.

Further embodiments of the present application also provide a computer program product comprising a computer program that, when run on a computer, causes the computer to perform the steps performed by the speech quality assessment server in the embodiments of fig. 5 or fig. 6 described above.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A voice quality assessment method is applied to a voice quality assessment server, and comprises the following steps:

receiving the recording information of M calls acquired by terminal equipment; the recording information of the call comprises: the conversation linguistic data of the conversation, the call connection timestamp and the call release timestamp; said M is greater than or equal to 1;

inputting the recorded information of M calls into an average opinion score (MOS) algorithm model, and calculating to obtain voice quality values of the M calls; the MOS algorithm model supports the parallel computation of the voice quality values of N calls; the N is greater than or equal to 1.

2. The method of claim 1, wherein if M is greater than N, N is greater than 1; inputting the recorded information of the M calls into an MOS algorithm model, and calculating to obtain the voice quality values of the M calls, wherein the method comprises the following steps:

distributing the recording information of the M calls to N voice quality evaluation queues;

and inputting the recorded information of the calls included in the N voice quality evaluation queues to an MOS algorithm model in parallel, and calculating to obtain the voice quality values of the M calls.

3. The method according to claim 2, wherein the parallel inputting the recorded information of the calls included in the N speech quality assessment queues to a MOS algorithm model in sequence, and calculating the speech quality values of M calls includes:

aiming at a first voice quality evaluation queue, serially inputting recording information of one or more calls in the first voice quality evaluation queue to an MOS algorithm model according to a preset sequence, and calculating to obtain voice quality values of one or more calls in the first voice quality evaluation queue; the first voice quality evaluation queue is any one of the N voice quality evaluation queues.

4. The method of claim 3, wherein the predetermined order comprises: the priority of the call is in order from high to low.

5. A voice quality assessment apparatus deployed in a voice quality assessment server, the apparatus comprising:

the receiving unit is used for receiving the recording information of the M calls acquired by the terminal equipment; the recording information of the call comprises: the conversation linguistic data of the conversation, the call connection timestamp and the call release timestamp; said M is greater than or equal to 1;

the processing unit is used for inputting the recorded information of the M calls into an average opinion score (MOS) algorithm model and calculating the voice quality values of the M calls; the MOS algorithm model supports the parallel computation of the voice quality values of N calls; the N is greater than or equal to 1.

6. The apparatus of claim 5, wherein if M is greater than N, N is greater than 1; the processing unit is specifically configured to:

7. The apparatus according to claim 6, wherein said parallel inputting the recorded information of the calls included in the N speech quality assessment queues to the MOS algorithm model in sequence, and calculating the speech quality values of M calls includes:

8. The apparatus of claim 7, wherein the preset order comprises: the priority of the call is in order from high to low.

9. A voice quality evaluation server, characterized by comprising: a processor, a memory; the processor is coupled with the memory for storing computer program code comprising computer instructions which, when executed by the speech quality assessment server, cause the speech quality assessment server to perform the speech quality assessment method of any of claims 1-4.

10. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the speech quality assessment method of any one of claims 1-4.

11. A voice quality evaluation system is characterized by comprising a voice quality evaluation server and a terminal device; wherein the voice quality evaluation server is configured to execute the voice quality evaluation method according to any one of claims 1 to 4.