CN112002312A

CN112002312A - Voice recognition method, device, computer program product and storage medium

Info

Publication number: CN112002312A
Application number: CN201910381065.8A
Authority: CN
Inventors: 常伟
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd; SF Tech Co Ltd
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2020-11-27

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device, a computer program product and a storage medium, wherein the voice recognition device acquires audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.

Description

Voice recognition method, device, computer program product and storage medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a speech recognition method, apparatus, computer program product, and storage medium.

Background

The speech recognition technology has achieved remarkable results, and is widely applied to a plurality of fields such as household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like. Speech recognition is the process of letting a machine translate audio into corresponding text or commands by recognition and understanding.

There are many open platforms for speech recognition in the market, and for the input audio to be recognized, the specification of each speech recognition engine needs to be strictly followed, for example, the audio to be recognized needs to use a specified audio format, the length of the audio to be recognized cannot exceed a preset size, and the like.

The traditional speech recognition equipment can only recognize the speech which accords with the speech recognition engine business specification, and different speech recognition engines have different limits on the speech to be recognized, so that different types of speech recognition engines need to be arranged for different forms of audio, and the development cost of the speech recognition engines is high.

Disclosure of Invention

Embodiments of the present application provide a speech recognition method, apparatus, computer program product, and storage medium, which can convert an audio to be recognized into an audio meeting a preset specification, and reduce the development cost of a speech recognition engine.

In one aspect, the present application is directed to a method of speech recognition, the method comprising:

acquiring audio to be identified;

judging whether the audio to be identified meets a preset specification or not through a conversion interface;

if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface;

and calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.

Optionally, in some embodiments, the determining, by the conversion interface, whether the audio to be recognized meets a preset specification includes:

judging whether the audio to be identified is non-compressed audio or not through the conversion interface; and/or the presence of a gas in the gas,

judging whether the audio format of the audio to be identified is a preset audio format or not through the conversion interface; and/or the presence of a gas in the gas,

and judging whether the audio size of the audio to be identified is not larger than a preset audio size through the conversion interface.

Optionally, in some embodiments, before the obtaining the audio to be identified, the method further includes:

and encapsulating an audio decompression method, an audio transcoding method and/or an audio segmentation method in the conversion interface.

Optionally, in some embodiments, if the audio to be recognized does not meet the preset specification, converting the audio to be recognized into an audio meeting the preset specification through the conversion interface includes:

if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by the audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,

if the audio format is not the preset audio format, carrying out audio transcoding processing on the audio to be identified through the audio transcoding method to obtain the audio to be identified in the preset audio format; and/or the presence of a gas in the gas,

and if the audio frequency is larger than the preset audio frequency, carrying out audio frequency segmentation processing on the audio frequency to be identified through the audio frequency segmentation method to obtain a plurality of sub audio frequencies to be identified, wherein the sub audio frequencies to be identified are not larger than the preset audio frequency.

judging whether the audio to be identified is a history identification audio according to the audio name of the audio to be identified;

if the audio is the historical recognition audio, extracting a voice recognition result corresponding to the audio name from a database;

if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification or not through the conversion interface.

Optionally, in some embodiments, after the calling a third party interface encapsulated with a speech recognition engine to perform speech recognition processing on the audio meeting the preset specification to obtain a speech recognition result, the method further includes:

and storing the voice recognition result into the database.

carrying out localized deployment on the voice recognition engine;

packaging the locally deployed speech recognition engine into the third-party interface;

and packaging the third party interface into the conversion interface.

Correspondingly, the present application further provides a speech recognition apparatus, specifically including:

the acquisition unit is used for acquiring the audio to be identified;

the judging unit is used for judging whether the audio to be identified meets the preset specification through a conversion interface;

the conversion unit is used for converting the audio to be recognized into the audio meeting the preset specification through the conversion interface when the audio to be recognized does not meet the preset specification;

and the recognition unit is used for calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.

Optionally, in some embodiments, the determining unit is specifically configured to:

Optionally, in some embodiments, the apparatus further comprises:

and the first packaging unit is used for packaging the audio decompression method, the audio transcoding method and/or the audio segmentation method in the conversion interface.

Optionally, in some embodiments, the conversion unit is specifically configured to:

Optionally, in some embodiments, the obtaining unit is specifically configured to:

Optionally, in some embodiments, the apparatus further comprises:

and the storage unit is used for storing the voice recognition result into the database.

Optionally, in some embodiments, the apparatus further comprises:

the deployment unit is used for carrying out localized deployment on the voice recognition engine;

the second packaging unit is used for packaging the speech recognition engine after localized deployment into the third-party interface;

and the third packaging unit is used for packaging the third party interface into the conversion interface.

Yet another aspect of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.

In addition, a storage medium is provided, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the speech recognition methods provided in the embodiments of the present application.

In the embodiment of the application, a voice recognition device acquires audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a speech recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of another speech recognition method provided in the embodiments of the present application;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a network device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description that follows, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the application have been described in language specific to above, it is not intended to be limited to the specific embodiments shown, and it will be recognized by those of ordinary skill in the art that various of the steps and operations described below may be implemented in hardware.

The principles of the present application may be employed in numerous other general-purpose or special-purpose computing, communication environments or configurations. Examples of well known computing systems, environments, and configurations that may be suitable for use with the application include, but are not limited to, hand-held telephones, personal computers, servers, multiprocessor systems, microcomputer-based systems, mainframe-based computers, and distributed computing environments that include any of the above systems or devices.

The terms "first", "second", and "third", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions.

The embodiment of the application provides a voice recognition method, a voice recognition device, a computer program product and a storage medium.

The speech recognition apparatus in the present application may be integrated in a network device, where the network device may be a server, or may be a terminal, and the terminal may include a mobile phone, a tablet Computer, a notebook Computer, and/or a Personal Computer (PC), etc.

Referring to fig. 1, fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application. The method comprises the following specific processes:

101. and acquiring the audio to be identified.

The audio to be identified comprises parameter information such as audio information, compression information, format information, name information and the like of the audio to be identified.

In some embodiments, the caller transmits the audio to be recognized to the conversion interface, wherein the conversion interface may be an http interface, and the conversion interface is open to the outside and provides parameter information such as the audio to be recognized for the caller to transmit.

In some embodiments, before obtaining the audio to be identified, the method further includes encapsulating, in the conversion interface, an audio decompression method, an audio transcoding method, and/or an audio slicing method.

Besides, methods such as an audio storage method and/or a text recording method may be packaged in the conversion interface.

In some embodiments, prior to obtaining the audio to be identified, the method further comprises: carrying out localized deployment on a voice recognition engine; and packaging the third party interface packaged with the voice recognition engine into a conversion interface.

In some embodiments, before packaging the third-party interface into the conversion interface, packaging the locally deployed speech recognition engine into the third-party interface is further included.

Specifically, the third-party Interface may be an Interface such as a Software Development Kit (SDK) or a Web Application Programming Interface (WebApi) that can encapsulate a speech recognition engine.

102. And judging whether the audio to be identified meets the preset specification or not through the conversion interface.

The preset specification corresponds to a specification required by a speech recognition engine corresponding to the speech recognition device, for example, if the specification required by the speech recognition engine corresponding to the speech recognition device is: uncompressed Audio, MP3 (English full name: Moving Picture Experts Group Audio Layer III) format, Audio size 10000000 bytes; then the preset specification in the adapter is also: uncompressed audio, MP3 format, audio size 10000000 bytes.

Specifically, whether the audio to be recognized meets the preset specification or not is judged through the conversion interface, and the method comprises the following steps:

judging whether the audio to be identified is non-compressed audio through a conversion interface; and/or the presence of a gas in the gas,

judging whether the audio format of the audio to be identified is a preset audio format or not through a conversion interface; and/or the presence of a gas in the gas,

and judging whether the audio size of the audio to be identified is not larger than the preset audio size through the conversion interface.

The judgment sequence for judging whether the audio to be identified is non-compressed audio, whether the audio is in a preset audio format and whether the audio is not larger than the preset audio size is not limited, and the three judgment steps can be carried out simultaneously or sequentially.

The number of the judgment can be determined according to the preset specification of the conversion port, if the preset specification of the conversion port only sets the specification of the non-compressed audio, only the judgment on whether the audio to be identified is the non-compressed audio is needed at the moment; if the uncompressed audio specification and the audio format specification are set, it is necessary to determine whether the audio to be recognized is an uncompressed audio or not and determine whether the audio to be recognized is a preset audio format or not, wherein the number and the type of the preset specification are not limited herein.

In some embodiments, the determining whether the audio to be recognized meets the preset specification through the conversion interface includes:

if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification through the conversion interface.

In some embodiments, after the speech recognition device receives the audio to be recognized, it is further required to receive a speech conversion instruction, and then the speech recognition device determines whether the audio to be recognized meets a preset specification through a conversion interface according to the speech conversion instruction.

103. And if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface.

The conversion interface is packaged with an audio decompression method, an audio transcoding method and/or an audio segmentation method. So specifically: if the audio frequency does not accord with the preset standard, the audio frequency to be identified is converted into the audio frequency which accords with the preset standard through the conversion interface, and the method comprises the following steps:

if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by an audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,

if the audio format is not the preset audio format, audio transcoding processing is carried out on the audio to be identified through an audio transcoding method, and the audio to be identified in the preset audio format is obtained; and/or the presence of a gas in the gas,

and if the audio size is larger than the preset audio size, performing audio segmentation processing on the audio to be identified by an audio segmentation method to obtain a plurality of sub-audios to be identified, wherein the sub-audios are not larger than the preset audio size.

If the audio to be identified is the compressed audio, then audio decompression processing is not required to be performed on the audio to be identified, and similarly, if the audio format of the audio to be identified is the preset audio format, audio transcoding processing is not required to be performed on the audio to be identified, and if the size of the audio to be identified is not larger than the size of the preset audio, segmentation processing is not required to be performed on the audio.

In some embodiments, if audio decompression, audio transcoding and audio segmentation are required to be performed on the audio to be identified, decompression is required to be performed on the audio to be identified first, then audio transcoding is performed on the audio to be identified which is subjected to the decompression, and finally segmentation is performed on the audio to be identified which is subjected to the decompression, i.e., the audio transcoding.

If the audio to be recognized is the audio meeting the preset specification, the conversion interface does not need to process the audio to be recognized at the moment, and directly transmits the audio to be recognized to the third-party interface, so that the voice recognition engine packaged on the third-party interface performs voice recognition processing on the audio information in the audio to be recognized.

104. And calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.

After the audio meeting the preset specification is obtained through the conversion interface, the voice recognition device calls a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result, wherein the voice recognition result is text information corresponding to the audio to be recognized.

In some embodiments, the speech recognition engine is stored locally, that is, the speech recognition engine is stored in a server corresponding to the speech recognition device, for example, if the chat software on the mobile phone performs speech recognition (speech to text) on speech sent by the user to the chat software, the speech needs to be sent to the corresponding server, and then the speech recognition engine deployed in the server performs speech recognition on the speech to obtain text information corresponding to the speech.

In some embodiments, after the voice recognition result is obtained through the voice recognition audio, the voice recognition result is returned to the display interface of the voice recognition device.

In some embodiments, after the voice recognition result is obtained through the voice recognition audio, the voice recognition result is also stored in the database, and in addition, the audio to be recognized can also be stored in the database.

Referring to fig. 2, fig. 2 is another schematic flow chart of a speech recognition method according to an embodiment of the present application, which is described in this embodiment by taking an execution subject as a speech recognition device, a conversion interface as an http interface, and a third-party interface as a WebApi, where a specific flow of the method may be as follows:

201. the voice recognition device acquires audio to be recognized.

In some embodiments, the caller transmits the audio to be recognized to an http interface in the speech recognition device, and the http interface is open to the outside and provides parameter information such as the audio to be recognized for the caller to transmit.

In some embodiments, before obtaining the audio to be identified, the method further includes encapsulating an audio decompression method, an audio transcoding method, and/or an audio slicing method in the http interface.

Besides, methods such as an audio storage method and/or a text recording method can be packaged in the http interface, wherein the audio storage method is used for storing the audio to be recognized, and the text recording method is used for storing the recognized voice recognition result into a local database or other related databases.

In some embodiments, prior to obtaining the audio to be identified, the method further comprises: carrying out localized deployment on a voice recognition engine; and packaging the WebApi packaged with the voice recognition engine into an http interface.

In some embodiments, before encapsulating the WebApi into the http interface, encapsulating the locally deployed speech recognition engine into the WebApi is further included.

202. The voice recognition device judges whether the audio to be recognized is a history recognition audio according to the audio name of the audio to be recognized, if not, step 203 is executed, and if so, step 211 is executed.

After the speech recognition device receives the audio to be recognized, in order to save resources, it may first determine whether the audio to be recognized is an already recognized audio according to the audio name of the audio to be recognized, and specifically, the speech recognition device stores the audio name of the already recognized audio and a speech recognition result corresponding to the already recognized audio.

If the audio to be recognized is the audio that has already been recognized, the voice recognition result corresponding to the audio name is extracted from the database without recognizing the given voice, and if the audio to be recognized is not the audio that has been subjected to the voice recognition, the following steps are executed.

203. The voice recognition device judges whether the audio to be recognized is non-compressed audio through the http interface, if not, step 204 is executed, and if yes, step 205 is executed.

Because the specification of the audio to be recognized of the voice recognition audio corresponding to the voice recognition device has the non-compressed audio, after the audio to be recognized is received, whether the audio to be recognized is the non-compressed audio can be judged through the http interface, if so, the next judgment step can be executed, and if not, the audio to be recognized also needs to be subjected to audio decompression.

204. And the voice recognition equipment performs audio decompression processing on the audio to be recognized through an audio decompression method to obtain the uncompressed audio to be recognized.

Because the audio decompression method is packaged in the http interface, when the audio to be identified is judged to be the compressed audio, the audio to be identified can be subjected to audio decompression processing through the audio decompression method, so that the non-compressed audio to be identified is obtained.

205. The voice recognition device judges whether the audio format of the audio to be recognized is a preset audio format through the http interface, if not, step 206 is executed, and if so, step 207 is executed.

Since the audio format is required to be a preset audio format, for example, an MP3 format, in the specification of the audio to be recognized of the speech recognition audio corresponding to the speech recognition device, after receiving the audio to be detected, or after performing audio decompression processing on the audio to be detected, it is determined whether the audio format of the audio to be recognized is the preset audio format through the http interface.

Specifically, after audio decompression processing is performed on the audio to be detected (if audio decompression processing is required), and an uncompressed audio to be recognized is obtained, the speech recognition device determines, through the http interface, whether the audio format of the audio to be recognized that has undergone audio decompression processing is a preset audio format, if so, may execute the next determination step, and if not, further performs audio transcoding processing on the audio to be recognized (if the previous audio compression processing has been performed, the previous audio to be recognized that has undergone audio decompression processing is here).

206. The voice recognition equipment carries out audio transcoding processing on the audio to be recognized through an audio transcoding method to obtain the audio to be recognized in a preset audio format.

Because the http interface is encapsulated with the audio transcoding method, when the audio format of the audio to be identified is determined not to be the preset audio format (for example, not to be the MP3 format), the audio to be identified needs to be subjected to audio transcoding processing by the audio transcoding method, so as to obtain the audio to be identified in the preset audio format, wherein if the audio to be identified is subjected to audio decompression processing before, at this time, the audio to be identified subjected to audio decompression processing is subjected to audio transcoding processing by the audio transcoding method, so as to obtain the audio to be identified conforming to the preset audio format.

207. The voice recognition device judges whether the audio size of the audio to be recognized is not larger than the preset audio size through the http interface, if not, step 208 is executed, and if so, step 209 is executed.

Because the audio size required in the specification of the audio to be recognized of the voice recognition audio corresponding to the voice recognition device cannot be larger than the preset audio size, after the audio to be detected is received, or after the audio to be detected is subjected to audio decompression processing and/or audio transcoding processing, whether the audio length of the audio to be recognized is not larger than the preset audio size or not is judged through the http interface.

208. The voice recognition equipment carries out audio segmentation processing on the audio to be recognized through an audio segmentation method to obtain a plurality of sub-audios to be recognized, wherein the sub-audios are not larger than the preset audio size.

The method comprises the steps that an http interface is packaged with an audio segmentation method, when the size (byte size) of the audio to be detected is judged to be larger than the preset audio size, the audio to be recognized is subjected to audio segmentation processing through the audio segmentation method, a plurality of sub-audios to be recognized are obtained, the sub-audios to be recognized are not larger than the preset audio size, and then the sub-audios to be recognized are sequentially sent to a third-party interface packaged with a voice recognition engine, so that the audio recognition audio in the third-party interface is subjected to audio recognition processing on the sub-audios to be recognized.

If the audio to be identified is subjected to audio decompression processing and/or audio transcoding processing before, then audio segmentation processing is performed on the audio to be identified which is subjected to the audio decompression processing and/or the audio transcoding processing at this time.

The specific code for performing audio segmentation processing on the audio to be recognized may be as follows:

209. and calling the WebApi packaged with the voice recognition engine by the voice recognition equipment to perform voice recognition processing on the audio conforming to the preset specification to obtain a voice recognition result.

After the voice recognition device converts the audio to be recognized into the audio meeting the preset specification through the http interface, the WebApi packaged with the voice recognition engine is called to perform voice recognition processing on the audio meeting the preset specification, and a voice recognition result is obtained.

Specifically, in some embodiments, the speech recognition engine is stored locally, that is, the speech recognition engine is stored in a server corresponding to the speech recognition device, for example, if the chat software on the mobile phone performs speech recognition (speech to text) on the speech sent by the user to the chat software, the speech needs to be sent to the corresponding server, and then the speech recognition engine deployed in the server performs speech recognition on the speech to obtain text information corresponding to the speech.

210. The speech recognition device stores the speech recognition result in a database.

211. And the voice recognition equipment extracts a voice recognition result corresponding to the audio name from the database.

Since the database stores the audio recognition result of the audio subjected to historical recognition and the audio name corresponding to the audio, after the voice recognition device receives the audio to be recognized, in order to improve the recognition speed and save the recognition resource, whether the audio to be recognized is the historical recognition audio can be judged according to the audio name of the audio to be recognized, and if the audio is the historical recognition audio, the voice recognition result corresponding to the audio name can be directly extracted from the database according to the audio name.

The method and the device can solve a series of problems that the conventional voice open platform on the market is complex in developing voice escape function configuration, not supported by formats, too large in files, various limitations and the like, improve development efficiency and save development cost.

In order to better implement the voice recognition method provided by the embodiment of the present application, the embodiment of the present application further provides a voice recognition device, where the voice recognition device may be specifically integrated in a network device, and the network device may be a server or a terminal. Wherein the meaning of the noun is the same as that in the voice recognition method, and the specific implementation details can refer to the description in the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application, where the speech recognition device includes: the acquisition unit 301, the judgment unit 302, the conversion unit 303, and the identification unit 304 are as follows:

an obtaining unit 301, configured to obtain an audio to be identified;

a determining unit 302, configured to determine whether the audio to be identified meets a preset specification through a conversion interface;

the conversion unit 303 is configured to convert the audio to be recognized into an audio meeting the preset specification through the conversion interface when the audio to be recognized does not meet the preset specification;

and the recognition unit 304 is configured to call a third party interface encapsulated with a speech recognition engine to perform speech recognition processing on the audio meeting the preset specification, so as to obtain a speech recognition result.

In some embodiments, the determining unit 302 is specifically configured to:

Referring to fig. 4, in some embodiments, the apparatus further includes:

a first encapsulating unit 305 for encapsulating an audio decompression method, an audio transcoding method and/or an audio slicing method in the conversion interface.

In some embodiments, the conversion unit 303 is specifically configured to:

In some embodiments, the obtaining unit 301 is specifically configured to:

In some embodiments, the apparatus further comprises:

a storage unit 306, configured to store the speech recognition result in the database.

In some embodiments, the apparatus further comprises:

a deployment unit 307, configured to perform localized deployment on the speech recognition engine;

a second packaging unit 308, configured to package the locally deployed speech recognition engine into the third-party interface;

a third packaging unit 309, configured to package the third party interface into the conversion interface.

In the embodiment of the application, the obtaining unit 301 obtains the audio to be identified; then, the determining unit 302 determines whether the audio to be recognized meets the preset specification through the conversion interface; if the audio does not meet the preset specification, the conversion unit 303 converts the audio to be recognized into the audio meeting the preset specification through the conversion interface; then, the recognition unit 304 calls a third-party interface encapsulated with a speech recognition engine to perform speech recognition processing on the audio meeting the preset specification, so as to obtain a speech recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.

Referring to fig. 5, the present application provides a network device 500, which may include one or more processors 501 of a processing core, one or more memories 502 of a computer-readable storage medium, a Radio Frequency (RF) circuit 503, a power supply 504, an input unit 505, and a display unit 506. Those skilled in the art will appreciate that the network device architecture shown in fig. 5 does not constitute a limitation of network devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 501 is a control center of the network device, connects various parts of the entire network device by using various interfaces and lines, and performs various functions of the network device and processes data by running or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the network device. Optionally, processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.

The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502.

The RF circuit 503 may be used for receiving and transmitting signals during the process of transmitting and receiving information.

The network device also includes a power supply 504 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 501 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.

The network device may further include an input unit 505, and the input unit 505 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The network device may also include a display unit 506, and the display unit 506 may be used to display information input by or provided to the user as well as various graphical user interfaces of the network device, which may be made up of graphics, text, icons, video, and any combination thereof. Specifically, in this embodiment, the processor 501 in the network device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application program stored in the memory 502, so as to implement various functions as follows:

acquiring audio to be identified;

As can be seen from the above, in the embodiment of the present application, the speech recognition apparatus obtains the audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the speech recognition processing methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:

acquiring audio to be identified;

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any of the speech recognition methods provided in the embodiments of the present application, the beneficial effects that can be achieved by any of the speech recognition methods provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

The foregoing has described in detail a speech recognition method, apparatus, computer program product and storage medium provided by embodiments of the present application, and specific embodiments are applied in the present application to explain the principles and implementations of the present application, and the description of the foregoing embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A speech recognition method, comprising:

acquiring audio to be identified;

2. The method according to claim 1, wherein the determining whether the audio to be recognized conforms to a preset specification through the conversion interface comprises:

3. The method of claim 2, wherein prior to obtaining the audio to be identified, the method further comprises:

4. The method according to claim 3, wherein if the audio to be recognized does not meet the predetermined specification, converting the audio to be recognized into the audio meeting the predetermined specification through the conversion interface comprises:

5. The method according to claim 1, wherein the determining whether the audio to be recognized conforms to a preset specification through the conversion interface comprises:

6. The method according to claim 5, wherein after the calling of the third-party interface encapsulated with the speech recognition engine performs speech recognition processing on the audio meeting the preset specification to obtain a speech recognition result, the method further comprises:

and storing the voice recognition result into the database.

7. The method according to any one of claims 1 to 6, wherein before the obtaining the audio to be identified, the method further comprises:

carrying out localized deployment on the voice recognition engine;

and packaging the third party interface into the conversion interface.

8. A speech recognition apparatus, comprising:

the acquisition unit is used for acquiring the audio to be identified;

9. A computer program product comprising instructions which, when run on a computer, cause the computer to carry out the steps in the speech recognition method according to any one of claims 1 to 7.

10. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the speech recognition method according to any one of claims 1 to 7.