CN118098291A

CN118098291A - Automatic ending recording identification method and device, electronic equipment and storage medium

Info

Publication number: CN118098291A
Application number: CN202311765815.4A
Authority: CN
Inventors: 李祥锐; 陈志波; 李晓敏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-05-28

Abstract

The disclosure provides a method and a device for automatically ending recording recognition, electronic equipment and a readable storage medium, and relates to the technical fields of voice recognition, ending signal recognition, man-machine interaction and the like. The method comprises the following steps: acquiring a continuously received audio signal stream; determining the current network condition, the language type corresponding to the audio signal flow and the selection priority corresponding to each voice recognition engine to be replaced respectively; the selection priority is determined based on the historical recognition speed and the historical recognition accuracy; determining a target voice recognition engine adapted to the audio signal stream according to the network condition, the language type and the selection priority of each voice recognition engine; the audio signal of the audio signal stream is continuously recognized by the target speech recognition engine until the end signal is recognized, and the audio signal located before the end signal is determined as the speech information to be recognized. The method can select the most suitable voice recognition engine for the received audio signal stream in the current scene to automatically finish recording recognition.

Description

Automatic ending recording identification method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of signal processing, in particular to the technical fields of voice recognition, end signal recognition, man-machine interaction and the like, and particularly relates to an automatic end recording recognition method, an automatic end recording recognition device, electronic equipment, a computer readable storage medium and a computer program product.

Background

Automatic end recording recognition is a speech recognition technique that automatically detects the end of a speech input and performs recognition and transcription at the time of recording. In automatic end recording recognition, the system will determine the end of the voice input based on the characteristics of the voice signal and the voice activity detection algorithm. Typically, the system will analyze silence segments or pauses in the speech signal to determine the end position of the recording. Once the system detects the appropriate pause or silence, it automatically stops the recording and recognizes and transcribes the recording. The technology is widely applied to the fields of voice transcription, voice recognition software, voice recording application and the like. The automatic recording identification method can realize an automatic recording identification process and improve the working efficiency and accuracy. The user does not need to stop recording manually, and the system can automatically recognize the end of recording and generate a corresponding text result.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for automatically finishing recording identification.

In a first aspect, an embodiment of the present disclosure provides a method for automatically ending a recording identification, including: acquiring a continuously received audio signal stream; determining the current network condition, the language type corresponding to the audio signal flow and the selection priority corresponding to each voice recognition engine to be replaced respectively; the selection priority is determined based on the historical recognition speed and the historical recognition accuracy; determining a target voice recognition engine adapted to the audio signal stream according to the network condition, the language type and the selection priority of each voice recognition engine; the audio signal of the audio signal stream is continuously recognized by the target speech recognition engine until the end signal is recognized, and the audio signal located before the end signal is determined as the speech information to be recognized.

In a second aspect, an embodiment of the present disclosure provides an automatic ending recording identifying device, including: an audio signal stream continuous receiving unit configured to acquire a continuously received audio signal stream; a parameter determining unit configured to determine a current network condition, a language type corresponding to the audio signal stream, and alternative selection priorities respectively corresponding to the voice recognition engines; the selection priority is determined based on the historical recognition speed and the historical recognition accuracy; a target speech recognition engine determining unit configured to determine a target speech recognition engine adapted to the audio signal stream according to the network condition, the language type, and the selection priority of each speech recognition engine; and an end signal recognition and to-be-recognized voice information determination unit configured to continuously recognize the audio signal of the audio signal stream using the target voice recognition engine until the end signal is recognized, and determine the audio signal preceding the end signal as to-be-recognized voice information.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement the automatic end recording identification method as described in any one of the implementations of the first aspect when executed.

In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement an automatic end recording identification method as described in any one of the implementations of the first aspect when executed.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, is capable of implementing the steps of the automatic end recording identification method as described in any of the implementations of the first aspect.

According to the automatic ending recording recognition scheme provided by the embodiment, for continuously received audio signal streams, by determining network condition parameters, language type parameters and selection priority parameters for influencing selection among a plurality of alternative voice recognition engines, a target voice recognition engine matched with the currently received audio signal stream is determined according to the parameters, and further the received audio signal is continuously recognized by the target voice recognition engine until an ending signal is recognized, so that all audio signals positioned before the ending signal are determined to be complete voice information to be recognized. The target voice recognition engine matched with the received audio signal stream in the current scene is accurately selected based on various parameters, so that the automatic finishing recording recognition is completed more quickly and accurately, more accurate voice content is obtained through quicker recognition, and the human-computer interaction experience is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture in which the present disclosure may be applied;

fig. 2 is a flowchart of a method for automatically ending recording identification according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a speech recognition framework according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of determining a selection priority of a speech recognition engine provided by an embodiment of the present disclosure;

FIGS. 5-8 are flowcharts of different implementations provided by embodiments of the present disclosure for determining a target speech recognition engine, respectively;

Fig. 9 is a block diagram of a configuration of an automatic-ending recording recognition device according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an electronic device adapted to perform an automatic-ending recording recognition method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

FIG. 1 illustrates an exemplary system architecture 100 in which embodiments of the automatic end recording identification methods, apparatus, electronic devices, and computer readable storage media of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103 for providing human-machine interaction services to users, a network 104, and a server 105 as back-end service support. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may perform man-machine interaction using the terminal devices 101, 102, 103, or may perform man-machine interaction with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit a message (e.g., a voice message), or the like. Various applications for implementing information communication between the terminal devices 101, 102, 103 and the server 105, such as a voice recognition type application, an automatic ending recording recognition type application, an instant messaging type application, and the like, may be installed on the terminal devices.

The terminal devices 101, 102, 103 and the server 105 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, 103 are software, they may be installed in the above-listed electronic devices, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein.

The terminal devices 101, 102, 103 can provide various services through various built-in applications, taking an automatic ending recording recognition type application capable of providing man-machine interaction services as an example, the terminal devices 101, 102, 103 can realize the following effects when running the automatic ending recording recognition type application: first, an input audio signal stream is continuously received through a pickup or microphone provided on the terminal apparatus 101, 102, 103; then, determining the current network condition, the language type corresponding to the audio signal flow and the alternative selection priority corresponding to each voice recognition engine, wherein the selection priority is determined based on the historical recognition speed and the historical recognition accuracy; next, determining a target speech recognition engine adapted to the audio signal stream according to the network condition, the language type and the selection priority of each speech recognition engine; finally, the target voice recognition engine is utilized to continuously recognize the audio signal of the audio signal stream until an end signal is recognized, and the audio signal before the end signal is determined as voice information to be recognized.

Further, when a speech recognition engine capable of providing only an online speech recognition service is installed or configured on the terminal device 101, 102, 103, when the speech recognition engine is determined as a speech recognition engine, data exchange with the server 105 serving as a service support for the speech recognition engine through the network 104 is also required.

That is, the automatic-ending recording identifying method provided in the embodiments of the present disclosure may be independently executed by the terminal devices 101, 102, 103, or may be executed by the terminal devices 101, 102, 103 in combination with the network 104 and the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of an automatic ending recording identification method according to an embodiment of the disclosure, wherein the flowchart 200 includes the following steps:

Step 201: acquiring a continuously received audio signal stream;

This step aims at continuously receiving, by an execution subject (for example, terminal devices 101, 102, 103 shown in fig. 1) of the automatic end recording identification method, an audio signal stream incoming by a user of the terminal device through a sound receiving component (for example, a sound pickup, a microphone array, etc.) provided on the terminal device.

The audio signal stream may be speech information which is specially spoken by the user of the terminal device to the terminal device, or may be spoken when communicating with other objects, and the terminal device may receive the speech information, or the beginning of the audio signal stream may include a specific wake-up word or instruction information for waking up the terminal device into the audio signal receiving state, in addition to the collection authorization of the speech information by the speaking object being obtained at the present time.

Step 202: determining the current network condition, the language type corresponding to the audio signal flow and the selection priority corresponding to each voice recognition engine to be replaced respectively;

On the basis of step 201, this step aims at determining, by the above-mentioned executing body, the network condition of the environment in which the terminal device is currently located, determining the language type corresponding to the voice information in the received audio signal stream, determining the selection priority which is installed and configured on the terminal device or is respectively corresponding to each voice recognition engine which can be called by the terminal device, and determining the selection priority based on the history recognition speed and the history recognition accuracy. The historical recognition speed refers to the recognition speed of the voice recognition engine for recognizing voice content or ending signals of the historical audio signal flow, and can be understood as time consumption required by recognition results and used for reflecting the recognition capability of different voice recognition engines at the speed level; the historical recognition accuracy refers to the accuracy of the speech recognition engine to the content recognized by the historical audio signal stream, and the accuracy can be obtained by using whether the recognition result is modified by a user and the conversion of the modification times, so as to reflect the recognition capability of different speech recognition engines in the accuracy level.

Of course, in addition to the recognition capability of the different speech recognition engines represented on the speed level and the accuracy level, in determining the selection priority of the different speech recognition engines in the actual application scenario, other influencing factors on the level, such as volatility (or stability), the use preference of a specific user, whether there is an additional capability of optimizing or adjusting the recognition result to a dialect, etc., may be introduced according to the actual requirements, which are not listed here.

Step 203: determining a target voice recognition engine adapted to the audio signal stream according to the network condition, the language type and the selection priority of each voice recognition engine;

On the basis of step 202, this step aims at jointly determining, by the above-mentioned executing body, a target speech recognition engine adapted to the audio signal stream according to at least one of the parameters of network conditions, language type and selection priority of the respective speech recognition engines. Numerous different specific details are set forth in the examples below to provide a thorough understanding of the specific implementations of the present steps.

Step 204: the audio signal of the audio signal stream is continuously recognized by the target speech recognition engine until the end signal is recognized, and the audio signal located before the end signal is determined as the speech information to be recognized.

On the basis of step 203, this step aims at continuously recognizing the audio signal of the audio signal stream by the above-mentioned execution subject using the target speech recognition engine until the end signal is recognized, and determining all the audio signals located before the end signal as complete speech information to be recognized.

Specifically, considering that each alternative speech recognition engine is always in a to-be-activated state, in order to avoid that other speech recognition engines enter an activated state by mistake to have an undesirable response to the audio signal stream, the audio signal stream may be continuously sent to the target speech recognition engine for continuous recognition, and meanwhile, a preset white noise signal stream is continuously sent to other speech recognition engines other than the target speech recognition engine, so that the other speech recognition engines can only read white noise and cannot make a response conflicting with the target speech recognition engine.

Further, after determining that the complete voice information to be processed is obtained, the executing body may further perform voice content recognition on the voice information to be recognized by using the target voice recognition engine to obtain a complete voice recognition result, and then display the voice recognition result on a display screen and/or perform voice broadcast to convey the voice recognition result to a user or a user of the executing body, so as to implement man-machine interaction.

According to the automatic ending recording recognition method provided by the embodiment of the disclosure, for continuously received audio signal streams, by determining network condition parameters, language type parameters and selection priority parameters for influencing selection among a plurality of alternative voice recognition engines, a target voice recognition engine adapted to the currently received audio signal stream is determined according to the parameters, and further the received audio signal is continuously recognized by the target voice recognition engine until an ending signal is recognized, so that all audio signals positioned before the ending signal are determined to be complete voice information to be recognized. The target voice recognition engine matched with the received audio signal stream in the current scene is accurately selected based on various parameters, so that the automatic finishing recording recognition is completed more quickly and accurately, more accurate voice content is obtained through quicker recognition, and the human-computer interaction experience is improved.

In order to better understand the relationship between the automatic end recording recognition scheme and the voice recognition engines provided in the present embodiment, the present embodiment further shows, through fig. 3, a structural block diagram of how a user interacts with each functional module and each functional layer that are engaged in the terminal device, and fig. 3 includes a user 310 for incoming audio signals, voice recognition engines 321, 322 and 323, and an automatic end recording recognition 330 for calling each voice recognition engine, a voice recognition policy management layer 340 for managing the automatic end recording recognition 330, and a voice search 350 for receiving the output result of the voice recognition management layer 340. Specifically, the automatic end recording recognition 330 further includes voice transfer text 331 (the audio stream data is converted into text, for example, the conversion recognition is completed through a native API: SFSpeechRecognizer under the IOS operating system), volume detection 332 (the volume decibels of the input audio is tested, for example, the decibel size is calculated through the vector root mean square value of the native API: AVAudioPCMBuffer in the IOS operating system), and pause detection 333 (for detecting the pause in the input audio, for example, the native API under the IOS operating system is implemented by implementing the voice under a specific value x and continuing t as a pause, the volume under x is unmanned, the duration t is regarded as the pause in the t time, the state of the voice with the pause being recognized, the signal regarded as the automatic end is regarded as the stop) so as to jointly implement the recognition of the voice content, the volume detection and the recognition of the end signal, the voice recognition policy management layer 340 further includes service management 341 (the service type implemented by the speech recognition engine used for managing the bottom layer, including the automatic end recognition service, the audio file recognition service, the manual end recording recognition service, etc. finally, the voice engine can be implemented by implementing different parameters), the fault tolerance can be implemented by implementing the speech engine again after the error can be switched to another error can be implemented by implementing the speech engine, for example, the error can be switched to another error can be made after another error is recorded, the error is made, for another error is made, for example, the error is made after another error is occurred, and another error is tried, and another error is made, configuration management 343 (parameter configuration for centrally managing various speech recognition engines) and path management 344 (control management of which speech recognition engine is used except for errors, including switch configuration, version control, etc.) to manage selection policies from different levels, respectively, the speech search 350 further specifically includes speech recognition 351 and volume change 352 to be used together to provide a speech search service.

To enhance understanding of how the implementation of determining selection priorities for each speech recognition engine is achieved, this embodiment also shows, by way of example, in FIG. 4, an implementation of jointly determining selection priorities using historical recognition speeds and historical recognition accuracies, the process 400 comprising the steps of:

Step 401: according to average recognition time consumption of respectively recognizing the first preset number of sample voice information by each voice recognition engine, converting to obtain historical recognition speed of each voice recognition engine;

step 402: determining the speed priority of the corresponding voice recognition engine according to the speed of each history recognition speed;

The two steps aim at the execution subject to average recognition time consumption of a certain amount of sample voice information (which can be from a historically received audio signal stream or a specific sample voice set) by each voice recognition engine, and further convert to obtain a historical recognition speed of each voice recognition engine, for example, for an audio signal with a duration of not more than 5 seconds, the average recognition time consumption is about 0.5 seconds. Then, the executing body sets the higher the speed priority of the speech recognition engine with higher recognition speed according to the speed of each history recognition speed, and sets the lower the speed priority of the speech recognition engine with higher recognition speed according to the speed of each history recognition speed, so that the speech recognition engine with higher recognition speed is more prone to be selected to process the subsequent audio signals through the priority setting mode.

Step 403: acquiring secondary modification times of a user on recognition results of the sample voice information by each voice recognition engine, and respectively determining the historical recognition accuracy of the corresponding voice recognition engine according to the secondary modification times;

Step 404: determining the accuracy priority of the corresponding voice recognition engine according to the accuracy of each history recognition;

The two steps are aimed at determining, by the execution body, the corresponding historical recognition accuracy according to the number of times of secondary modification of the recognition result of each speech recognition engine on the sample speech information (which may be from the historically received audio signal stream or from a specific sample speech set) by the user, respectively, it being understood that the more accurate the recognition means the lower the probability of modification and thus the lower the number of secondary modification. Then the executing body sets the higher the accuracy priority of the voice recognition engine with higher recognition accuracy according to the accuracy of each history recognition, and otherwise sets the lower the accuracy priority of the voice recognition engine with higher recognition accuracy, so that the voice recognition engine with higher accuracy is more prone to be selected to process subsequent audio signals through the priority setting mode.

Step 405: and respectively determining the selection priority of each voice recognition engine according to the speed priority and the accuracy priority of each voice recognition engine.

Based on step 402 and step 404, this step aims to jointly determine, by the above-mentioned executing body, the selection priority corresponding to each speech recognition engine according to the speed priority and the accuracy priority of each speech recognition engine at the same time.

Considering that the emphasis on speed priority and accuracy priority may be different without using a scene, which in turn affects the difference in selection priority, one implementation, including but not limited to, may be:

Firstly, determining different first weights allocated to speed priorities under different application scenes and different second weights allocated to accuracy priorities under different application scenes, then determining a current target application scene according to an audio signal stream, determining a target first weight and a target second weight matched with the target application scene, weighting the speed priorities by using the target first weight, weighting the accuracy priorities by using the target second weight, correspondingly obtaining weighted speed priorities and weighted accuracy priorities, and finally determining the sum of the weighted speed priorities and the weighted accuracy priorities of each voice recognition engine as the selection priority of the corresponding voice recognition engine.

That is, the present embodiment calculates the selection priority in the current application scenario by introducing a weight-based weighting calculation method based on different scenarios, so as to select a target speech recognition engine more suitable for recognizing the audio signal stream by using the selection priority more suitable for the current application scenario.

The following details the implementation of the different determinations to the target speech recognition engine by fig. 5-8, respectively, wherein fig. 5 illustrates by a flow 500 one implementation comprising the steps of:

step 501: determining an offline speech recognition engine from among the alternative speech recognition engines in response to the network condition being in a state where the data transmission speed is below a preset speed;

step 502: the offline speech recognition engine with the highest selection priority is determined to be the target speech recognition engine that is adapted to the audio signal stream.

The present embodiment uses two parameters, namely, network condition and selection priority of each speech recognition engine, to determine the target speech recognition engine adapted to the audio signal stream, that is, mainly for the case that the network condition is poor (that is, the data transmission speed is low, for example, the transmission speed is lower than 20 KB/s), then since the network condition is crossed, it is inconvenient to use the online speech recognition engine performing the speech recognition in an online manner, the offline speech recognition engine supporting the offline recognition is first screened out, and then the offline speech recognition engine having the highest selection priority is determined as the target speech recognition engine adapted to the audio signal stream.

Fig. 6 shows another implementation by a flow 600 comprising the steps of:

Step 601: in response to the language category being a small language with the frequency of use lower than a preset frequency, determining a small language voice recognition engine supporting recognition of the small language from the alternative voice recognition engines;

step 602: the small language speech recognition engine with the highest selection priority is determined as the target speech recognition engine that is adapted to the audio signal stream.

Unlike the two parameters of network condition and selection priority used in the embodiment shown in fig. 5, the present embodiment uses two parameters of language type and selection priority to determine the target speech recognition engine adapted to the audio signal stream, that is, mainly for the case that the language type is a small language with a low frequency of use (where the frequency of use can also be obtained by scaling the size of the use range and average use rate of the language in the world), since most speech recognition engines cannot recognize all language types, only the small language speech recognition engine supporting recognition of the small language can be determined in each alternative speech recognition engine, and then the offline speech recognition engine having the highest selection priority therein is determined as the target speech recognition engine adapted to the audio signal stream.

FIG. 7 is a further adjustment based on the implementation provided in FIG. 6, showing a further implementation comprising the steps of:

Step 701: in response to the language category being a small language with the frequency of use lower than a preset frequency, determining a small language voice recognition engine supporting recognition of the small language from the alternative voice recognition engines;

Step 701 corresponds to step 601 and will not be described in detail herein.

Step 702: in response to the small language speech recognition engine including a native engine and a non-native engine that are adapted to an operating system of the current device and the non-native engine having a selection priority not higher than a native engine preset number of levels, the native engine is determined to be a target speech recognition engine adapted to the audio signal stream.

Based on step 701, this step aims at further adapting the small language Speech recognition engine to the operating system of the current device by the execution subject (for example, taking the IOS operating system as an example, the small language Speech recognition engine itself has three native Speech recognition APIs integrated in the operating system, for example, speech, AVAudioSession, AVAudioEngine, where spec is an IOS native Speech recognition framework, its internal API-SFSpeechRecognition may implement Speech recognition capabilities including audio stream recognition and audio file recognition, AVAudioSession is an IOS audio stream session management framework, task scheduling for helping to process audio input and output, AVAudioEngine is an IOS microphone audio stream input, and converts the data input by the microphone into an audio stream), and a non-native engine (still taking the IOS operating system as an example, all the Speech recognition engines native to the non-native engine are all non-native engines, for example, the Speech recognition schemes developed by the third parties), in this case, considering that the native engines tend to call more smoothly than the non-native engines, and the small language support is also relatively better, so long as the non-native engines have a priority of selection rather than the non-native engines, for example, the non-native engines tend to be more easily competing for the voice recognition engines to acquire the target Speech signals than the two-class of the voice recognition engines.

Fig. 8 shows another implementation, by way of a flow 800, that includes the steps of:

step 801: in response to the language category being a small language with the frequency of use lower than a preset frequency, determining a small language voice recognition engine supporting recognition of the small language from the alternative voice recognition engines;

Step 802: responding to the network condition in a state that the data transmission speed is lower than a preset speed, and determining a voice recognition engine supporting offline operation in the small language voice recognition engines as an offline small language voice recognition engine;

Step 803: the offline small language speech recognition engine with the highest selection priority is determined to be the target speech recognition engine that is adapted to the audio signal stream.

Compared with the implementation modes respectively shown in fig. 5-7, the present embodiment uses three parameters, i.e. network condition, whether the language type is small language and selection priority, together to determine the target speech recognition engine adapted to the actual audio stream, and performs first screening according to the small language, then performs second screening according to the network condition, and finally performs screening according to the selection priority, so as to ensure the human-computer interaction experience as much as possible.

It should be noted that, in each of the foregoing embodiments, mainly, how to select, from among the multiple alternative speech recognition engines, a target speech recognition engine that is most adapted to the currently received audio signal stream, that is, each alternative speech recognition engine should be in a callable state by default, and in an actual application scenario, it may occur that an application interface that initiates or enters the audio signal stream cannot successfully start or call a certain speech recognition engine (for example, in a low version IOS system, a native speech recognition API may not be normally called), then only a speech recognition engine that can be normally started or called may be selected. And the speech recognition services provided by some non-native engines require additional tariffs, then whether additional tariffs are allowed to be consumed should also be taken as one of the exclusion criteria.

With further reference to fig. 9, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of an apparatus for automatically ending a recording identification device, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 9, the automatic end recording identifying apparatus 900 of the present embodiment may include: an audio signal stream continuous receiving unit 901, a parameter determining unit 902, a target speech recognition engine determining unit 903, an end signal recognition and a speech information to be recognized determining unit 904. Wherein, the audio signal stream continuous receiving unit 901 is configured to obtain a continuously received audio signal stream; a parameter determining unit 902 configured to determine a current network condition, a language type corresponding to the audio signal stream, and alternative selection priorities corresponding to the speech recognition engines, respectively; the selection priority is determined based on the historical recognition speed and the historical recognition accuracy; a target speech recognition engine determining unit 903 configured to determine a target speech recognition engine adapted to the audio signal stream according to the network condition, the language type, and the selection priority of each speech recognition engine; the end signal recognition and to-be-recognized voice information determination unit 904 is configured to continuously recognize the audio signal of the audio signal stream using the target voice recognition engine until the end signal is recognized, and determine the audio signal preceding the end signal as to-be-recognized voice information.

In this embodiment, in the automatic end recording identification apparatus 900: the specific processing and the technical effects of the continuous audio signal stream receiving unit 901, the parameter determining unit 902, the target speech recognition engine determining unit 903, the end signal recognition and the to-be-recognized speech information determining unit 904 may refer to the relevant descriptions of steps 201 to 204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of this embodiment, the automatic ending record identifying device 900 further includes: the selection priority determining unit may further include:

a history recognition speed determining subunit configured to obtain a history recognition speed of each speech recognition engine by converting the average recognition time consumption of the first preset number of sample speech information respectively recognized by each speech recognition engine;

A speed priority determining subunit configured to determine, according to the speed of each history recognition speed, the speed priority of the corresponding speech recognition engine;

a history recognition accuracy determining subunit configured to obtain the number of secondary modification times of the recognition result of each voice recognition engine on the sample voice information by the user, and determine the history recognition accuracy of the corresponding voice recognition engine according to each secondary modification time;

an accuracy priority determining subunit configured to determine, according to the level of each history recognition accuracy, the level of accuracy priority of the corresponding speech recognition engine;

A selection priority determination subunit configured to determine a selection priority of each speech recognition engine based on the speed priority and the accuracy priority of each speech recognition engine, respectively.

In some optional implementations of the present embodiment, the selection priority determination subunit is further configured to:

Determining different first weights allocated to the speed priorities in different application scenes and different second weights allocated to the accuracy priorities in different application scenes;

Determining a current target application scene according to the audio signal stream, and determining a target first weight and a target second weight matched with the target application scene;

Weighting the speed priority by using a first weight of the target, and weighting the accuracy priority by using a second weight of the target, so as to correspondingly obtain a weighted speed priority and a weighted accuracy priority;

And respectively determining the sum of the weighted speed priority and the weighted accuracy priority of each voice recognition engine as the selection priority of the corresponding voice recognition engine.

In some optional implementations of the present embodiment, the target speech recognition engine determination unit 903 may be further configured to:

determining an offline speech recognition engine from among the alternative speech recognition engines in response to the network condition being in a state where the data transmission speed is below a preset speed;

The offline speech recognition engine with the highest selection priority is determined to be the target speech recognition engine that is adapted to the audio signal stream.

in response to the language category being a small language with the frequency of use lower than a preset frequency, determining a small language voice recognition engine supporting recognition of the small language from the alternative voice recognition engines;

The small language speech recognition engine with the highest selection priority is determined as the target speech recognition engine that is adapted to the audio signal stream.

in response to the small language speech recognition engine including a native engine and a non-native engine that are adapted to an operating system of the current device and the non-native engine having a selection priority not higher than a native engine preset number of levels, the native engine is determined to be a target speech recognition engine adapted to the audio signal stream.

Responding to the network condition in a state that the data transmission speed is lower than a preset speed, and determining a voice recognition engine supporting offline operation in the small language voice recognition engines as an offline small language voice recognition engine;

The offline small language speech recognition engine with the highest selection priority is determined to be the target speech recognition engine that is adapted to the audio signal stream.

In some optional implementations of the present embodiment, the end signal recognition and to-be-recognized voice information determination unit 904 may include an end signal recognition subunit configured to continuously recognize the audio signal of the audio signal stream with the target voice recognition engine, the end signal recognition subunit may be further configured to:

Continuously transmitting the audio signal stream to a target voice recognition engine for continuous recognition;

And simultaneously continuously transmitting the preset white noise signal stream to other voice recognition engines of the non-target voice recognition engine.

In some optional implementations of this embodiment, the automatic ending record identifying device 900 may further include:

the voice recognition result determining unit is configured to perform voice content recognition on the voice information to be recognized by using the target voice recognition engine to obtain a voice recognition result;

and the voice recognition result conveying unit is configured to display the voice recognition result on a display screen and/or conduct voice broadcasting.

The present embodiment exists as an embodiment of the apparatus corresponding to the embodiment of the method, and the automatic ending sound recording recognition apparatus provided in the present embodiment determines, for a continuously received audio signal stream, a target speech recognition engine adapted to a currently received audio signal stream according to network condition parameters, language type parameters, and selection priority parameters for influencing selection among a plurality of alternative speech recognition engines, thereby determining, by using the target speech recognition engine, all audio signals located before the ending signal as complete speech information to be recognized, by continuously recognizing the received audio signal until the ending signal is recognized. The target voice recognition engine matched with the received audio signal stream in the current scene is accurately selected based on various parameters, so that the automatic finishing recording recognition is completed more quickly and accurately, more accurate voice content is obtained through quicker recognition, and the human-computer interaction experience is improved.

According to an embodiment of the present disclosure, the present disclosure further provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to implement the automatic end recording identification method described in any of the embodiments above.

According to an embodiment of the present disclosure, there is also provided a readable storage medium storing computer instructions for enabling a computer to implement the automatic end recording identification method described in any of the above embodiments when executed.

According to an embodiment of the present disclosure, the present disclosure further provides a computer program product, which when executed by a processor is capable of implementing the automatic end recording identification method described in any of the above embodiments.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, for example, automatically ending the recording identification method. For example, in some embodiments, the auto-end recording identification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the above-described automatic end recording identification method may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the auto-end recording identification method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) PRIVATE SERVER service.

According to the technical scheme of the embodiment of the disclosure, for the continuously received audio signal stream, by determining network condition parameters, language type parameters and selection priority parameters for influencing selection among a plurality of alternative voice recognition engines, a target voice recognition engine matched with the currently received audio signal stream is determined according to the parameters, and the received audio signal is continuously recognized by the target voice recognition engine until an ending signal is recognized, so that all audio signals positioned before the ending signal are determined to be complete voice information to be recognized. The target voice recognition engine matched with the received audio signal stream in the current scene is accurately selected based on various parameters, so that the automatic finishing recording recognition is completed more quickly and accurately, more accurate voice content is obtained through quicker recognition, and the human-computer interaction experience is improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An automatic ending recording identification method, comprising:

acquiring a continuously received audio signal stream;

Determining the current network condition, the language type corresponding to the audio signal flow and the alternative selection priority corresponding to each voice recognition engine respectively; the selection priority is determined based on the historical recognition speed and the historical recognition accuracy;

Determining a target speech recognition engine adapted to the audio signal stream according to the network condition, the language type and the selection priority of each speech recognition engine;

and continuously identifying the audio signal of the audio signal stream by utilizing the target voice identification engine until an end signal is identified, and determining the audio signal positioned before the end signal as voice information to be identified.

2. The method of claim 1, wherein the determining of the selection priority comprises:

according to average recognition time consumption of each voice recognition engine for respectively recognizing the first preset number of sample voice information, converting to obtain historical recognition speed of each voice recognition engine;

determining the speed priority of the corresponding voice recognition engine according to the speed of each history recognition speed;

acquiring secondary modification times of a user on the recognition results of the voice recognition engines on the sample voice information respectively, and determining the historical recognition accuracy of the corresponding voice recognition engines according to the secondary modification times respectively;

Determining the accuracy priority of the corresponding voice recognition engine according to the accuracy of each history recognition;

And respectively determining the selection priority of each voice recognition engine according to the speed priority and the accuracy priority of each voice recognition engine.

3. The method of claim 2, wherein the determining the selection priority of each of the speech recognition engines based on the speed priority and the accuracy priority of each of the speech recognition engines, respectively, comprises:

determining a target application scene to which the audio signal stream belongs currently according to the audio signal stream, and determining a target first weight and a target second weight matched with the target application scene;

Weighting the speed priority by using the target first weight, and weighting the accuracy priority by using the target second weight, so as to correspondingly obtain a weighted speed priority and a weighted accuracy priority;

4. The method of claim 1, wherein the determining a target speech recognition engine that is adapted to the audio signal stream based on the network conditions, the language category, and a selection priority of each of the speech recognition engines comprises:

the offline speech recognition engine with the highest selection priority is determined as the target speech recognition engine adapted to the audio signal stream.

5. The method of claim 1, wherein the determining a target speech recognition engine that is adapted to the audio signal stream based on the network conditions, the language category, and a selection priority of each of the speech recognition engines comprises:

Determining a small language voice recognition engine supporting recognition of the small language in each voice recognition engine to be replaced in response to the language type being the small language with the frequency lower than the preset frequency;

the small language speech recognition engine with the highest selection priority is determined as the target speech recognition engine adapted to the audio signal stream.

6. The method of claim 1, wherein the determining a target speech recognition engine that is adapted to the audio signal stream based on the network conditions, the language category, and a selection priority of each of the speech recognition engines comprises:

and in response to the small language speech recognition engine comprising a native engine and a non-native engine adapted to an operating system of a current device, and the non-native engine having a selection priority not higher than the native engine preset number of levels, determining the native engine as a target speech recognition engine adapted to the audio signal stream.

7. The method of claim 1, wherein the determining a target speech recognition engine that is adapted to the audio signal stream based on the network conditions, the language category, and a selection priority of each of the speech recognition engines comprises:

Responding to the network condition in a state that the data transmission speed is lower than a preset speed, and determining a voice recognition engine supporting offline operation in the small language voice recognition engine as an offline small language voice recognition engine;

The offline small language speech recognition engine with the highest selection priority is determined as the target speech recognition engine that is adapted to the audio signal stream.

8. The method of claim 1, wherein the utilizing the target speech recognition engine to continuously recognize the audio signal of the audio signal stream comprises:

Continuously transmitting the audio signal stream to the target voice recognition engine for continuous recognition;

And continuously transmitting the preset white noise signal stream to other voice recognition engines which are not the target voice recognition engine.

9. The method of any of claims 1-8, further comprising:

performing voice content recognition on the voice information to be recognized by using the target voice recognition engine to obtain a voice recognition result;

And displaying the voice recognition result on a display screen and/or performing voice broadcasting.

10. An automatic end recording identification device, comprising:

An audio signal stream continuous receiving unit configured to acquire a continuously received audio signal stream;

a parameter determining unit configured to determine a current network condition, a language type corresponding to the audio signal stream, and alternative selection priorities corresponding to the voice recognition engines respectively; the selection priority is determined based on the historical recognition speed and the historical recognition accuracy;

a target speech recognition engine determining unit configured to determine a target speech recognition engine adapted to the audio signal stream according to the network condition, the language type, and a selection priority of each of the speech recognition engines;

and the ending signal recognition and voice information to be recognized determining unit is configured to continuously recognize the audio signal of the audio signal stream by using the target voice recognition engine until the ending signal is recognized, and determine the audio signal positioned before the ending signal as the voice information to be recognized.

11. The apparatus of claim 10, further comprising: a selection priority determining unit, the selection priority determining unit further comprising:

A history recognition speed determining subunit configured to obtain, in a conversion manner, a history recognition speed of each of the speech recognition engines according to average recognition time consumption of each of the speech recognition engines for respectively recognizing the first preset number of sample speech information;

A speed priority determining subunit configured to determine, according to the speed of each of the historic recognition speeds, the speed priority of the corresponding speech recognition engine;

A history recognition accuracy determining subunit configured to obtain the number of secondary modification times of the recognition result of each of the speech recognition engines on the sample speech information by the user, and determine the history recognition accuracy of the corresponding speech recognition engine according to each of the number of secondary modification times, respectively;

an accuracy priority determining subunit configured to determine, according to the level of each of the history recognition accuracies, the level of accuracy priority of the corresponding speech recognition engine;

a selection priority determination subunit configured to determine a selection priority of each of the speech recognition engines, respectively, according to a speed priority and an accuracy priority of each of the speech recognition engines.

12. The apparatus of claim 11, wherein the selection priority determination subunit is further configured to:

13. The apparatus of claim 10, wherein the target speech recognition engine determination unit is further configured to:

14. The apparatus of claim 10, wherein the target speech recognition engine determination unit:

15. The apparatus of claim 10, wherein the target speech recognition engine determination unit:

16. The apparatus of claim 10, wherein the target speech recognition engine determination unit:

17. The apparatus of claim 10, wherein the end signal recognition and to-be-recognized voice information determination unit comprises an end signal recognition subunit configured to continuously recognize an audio signal of the audio signal stream with the target voice recognition engine, the end signal recognition subunit further configured to:

18. The apparatus of any of claims 10-17, further comprising:

the voice recognition result determining unit is configured to perform voice content recognition on the voice information to be recognized by utilizing the target voice recognition engine to obtain a voice recognition result;

19. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the auto-end recording identification method of any one of claims 1-9.

20. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the automatic end recording identification method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the automatic end recording identification method according to any one of claims 1-9.