CN112002312A - Voice recognition method, device, computer program product and storage medium - Google Patents

Voice recognition method, device, computer program product and storage medium Download PDF

Info

Publication number
CN112002312A
CN112002312A CN201910381065.8A CN201910381065A CN112002312A CN 112002312 A CN112002312 A CN 112002312A CN 201910381065 A CN201910381065 A CN 201910381065A CN 112002312 A CN112002312 A CN 112002312A
Authority
CN
China
Prior art keywords
audio
identified
voice recognition
preset
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910381065.8A
Other languages
Chinese (zh)
Inventor
常伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
SF Tech Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201910381065.8A priority Critical patent/CN112002312A/en
Publication of CN112002312A publication Critical patent/CN112002312A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device, a computer program product and a storage medium, wherein the voice recognition device acquires audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.

Description

Voice recognition method, device, computer program product and storage medium
Technical Field
The present application relates to the field of communications technologies, and in particular, to a speech recognition method, apparatus, computer program product, and storage medium.
Background
The speech recognition technology has achieved remarkable results, and is widely applied to a plurality of fields such as household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like. Speech recognition is the process of letting a machine translate audio into corresponding text or commands by recognition and understanding.
There are many open platforms for speech recognition in the market, and for the input audio to be recognized, the specification of each speech recognition engine needs to be strictly followed, for example, the audio to be recognized needs to use a specified audio format, the length of the audio to be recognized cannot exceed a preset size, and the like.
The traditional speech recognition equipment can only recognize the speech which accords with the speech recognition engine business specification, and different speech recognition engines have different limits on the speech to be recognized, so that different types of speech recognition engines need to be arranged for different forms of audio, and the development cost of the speech recognition engines is high.
Disclosure of Invention
Embodiments of the present application provide a speech recognition method, apparatus, computer program product, and storage medium, which can convert an audio to be recognized into an audio meeting a preset specification, and reduce the development cost of a speech recognition engine.
In one aspect, the present application is directed to a method of speech recognition, the method comprising:
acquiring audio to be identified;
judging whether the audio to be identified meets a preset specification or not through a conversion interface;
if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface;
and calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
Optionally, in some embodiments, the determining, by the conversion interface, whether the audio to be recognized meets a preset specification includes:
judging whether the audio to be identified is non-compressed audio or not through the conversion interface; and/or the presence of a gas in the gas,
judging whether the audio format of the audio to be identified is a preset audio format or not through the conversion interface; and/or the presence of a gas in the gas,
and judging whether the audio size of the audio to be identified is not larger than a preset audio size through the conversion interface.
Optionally, in some embodiments, before the obtaining the audio to be identified, the method further includes:
and encapsulating an audio decompression method, an audio transcoding method and/or an audio segmentation method in the conversion interface.
Optionally, in some embodiments, if the audio to be recognized does not meet the preset specification, converting the audio to be recognized into an audio meeting the preset specification through the conversion interface includes:
if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by the audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,
if the audio format is not the preset audio format, carrying out audio transcoding processing on the audio to be identified through the audio transcoding method to obtain the audio to be identified in the preset audio format; and/or the presence of a gas in the gas,
and if the audio frequency is larger than the preset audio frequency, carrying out audio frequency segmentation processing on the audio frequency to be identified through the audio frequency segmentation method to obtain a plurality of sub audio frequencies to be identified, wherein the sub audio frequencies to be identified are not larger than the preset audio frequency.
Optionally, in some embodiments, the determining, by the conversion interface, whether the audio to be recognized meets a preset specification includes:
judging whether the audio to be identified is a history identification audio according to the audio name of the audio to be identified;
if the audio is the historical recognition audio, extracting a voice recognition result corresponding to the audio name from a database;
if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification or not through the conversion interface.
Optionally, in some embodiments, after the calling a third party interface encapsulated with a speech recognition engine to perform speech recognition processing on the audio meeting the preset specification to obtain a speech recognition result, the method further includes:
and storing the voice recognition result into the database.
Optionally, in some embodiments, before the obtaining the audio to be identified, the method further includes:
carrying out localized deployment on the voice recognition engine;
packaging the locally deployed speech recognition engine into the third-party interface;
and packaging the third party interface into the conversion interface.
Correspondingly, the present application further provides a speech recognition apparatus, specifically including:
the acquisition unit is used for acquiring the audio to be identified;
the judging unit is used for judging whether the audio to be identified meets the preset specification through a conversion interface;
the conversion unit is used for converting the audio to be recognized into the audio meeting the preset specification through the conversion interface when the audio to be recognized does not meet the preset specification;
and the recognition unit is used for calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
Optionally, in some embodiments, the determining unit is specifically configured to:
judging whether the audio to be identified is non-compressed audio or not through the conversion interface; and/or the presence of a gas in the gas,
judging whether the audio format of the audio to be identified is a preset audio format or not through the conversion interface; and/or the presence of a gas in the gas,
and judging whether the audio size of the audio to be identified is not larger than a preset audio size through the conversion interface.
Optionally, in some embodiments, the apparatus further comprises:
and the first packaging unit is used for packaging the audio decompression method, the audio transcoding method and/or the audio segmentation method in the conversion interface.
Optionally, in some embodiments, the conversion unit is specifically configured to:
if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by the audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,
if the audio format is not the preset audio format, carrying out audio transcoding processing on the audio to be identified through the audio transcoding method to obtain the audio to be identified in the preset audio format; and/or the presence of a gas in the gas,
and if the audio frequency is larger than the preset audio frequency, carrying out audio frequency segmentation processing on the audio frequency to be identified through the audio frequency segmentation method to obtain a plurality of sub audio frequencies to be identified, wherein the sub audio frequencies to be identified are not larger than the preset audio frequency.
Optionally, in some embodiments, the obtaining unit is specifically configured to:
judging whether the audio to be identified is a history identification audio according to the audio name of the audio to be identified;
if the audio is the historical recognition audio, extracting a voice recognition result corresponding to the audio name from a database;
if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification or not through the conversion interface.
Optionally, in some embodiments, the apparatus further comprises:
and the storage unit is used for storing the voice recognition result into the database.
Optionally, in some embodiments, the apparatus further comprises:
the deployment unit is used for carrying out localized deployment on the voice recognition engine;
the second packaging unit is used for packaging the speech recognition engine after localized deployment into the third-party interface;
and the third packaging unit is used for packaging the third party interface into the conversion interface.
Yet another aspect of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.
In addition, a storage medium is provided, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the speech recognition methods provided in the embodiments of the present application.
In the embodiment of the application, a voice recognition device acquires audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a speech recognition method provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of another speech recognition method provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a network device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description that follows, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the application have been described in language specific to above, it is not intended to be limited to the specific embodiments shown, and it will be recognized by those of ordinary skill in the art that various of the steps and operations described below may be implemented in hardware.
The principles of the present application may be employed in numerous other general-purpose or special-purpose computing, communication environments or configurations. Examples of well known computing systems, environments, and configurations that may be suitable for use with the application include, but are not limited to, hand-held telephones, personal computers, servers, multiprocessor systems, microcomputer-based systems, mainframe-based computers, and distributed computing environments that include any of the above systems or devices.
The terms "first", "second", and "third", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions.
The embodiment of the application provides a voice recognition method, a voice recognition device, a computer program product and a storage medium.
The speech recognition apparatus in the present application may be integrated in a network device, where the network device may be a server, or may be a terminal, and the terminal may include a mobile phone, a tablet Computer, a notebook Computer, and/or a Personal Computer (PC), etc.
Referring to fig. 1, fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application. The method comprises the following specific processes:
101. and acquiring the audio to be identified.
The audio to be identified comprises parameter information such as audio information, compression information, format information, name information and the like of the audio to be identified.
In some embodiments, the caller transmits the audio to be recognized to the conversion interface, wherein the conversion interface may be an http interface, and the conversion interface is open to the outside and provides parameter information such as the audio to be recognized for the caller to transmit.
In some embodiments, before obtaining the audio to be identified, the method further includes encapsulating, in the conversion interface, an audio decompression method, an audio transcoding method, and/or an audio slicing method.
Besides, methods such as an audio storage method and/or a text recording method may be packaged in the conversion interface.
In some embodiments, prior to obtaining the audio to be identified, the method further comprises: carrying out localized deployment on a voice recognition engine; and packaging the third party interface packaged with the voice recognition engine into a conversion interface.
In some embodiments, before packaging the third-party interface into the conversion interface, packaging the locally deployed speech recognition engine into the third-party interface is further included.
Specifically, the third-party Interface may be an Interface such as a Software Development Kit (SDK) or a Web Application Programming Interface (WebApi) that can encapsulate a speech recognition engine.
102. And judging whether the audio to be identified meets the preset specification or not through the conversion interface.
The preset specification corresponds to a specification required by a speech recognition engine corresponding to the speech recognition device, for example, if the specification required by the speech recognition engine corresponding to the speech recognition device is: uncompressed Audio, MP3 (English full name: Moving Picture Experts Group Audio Layer III) format, Audio size 10000000 bytes; then the preset specification in the adapter is also: uncompressed audio, MP3 format, audio size 10000000 bytes.
Specifically, whether the audio to be recognized meets the preset specification or not is judged through the conversion interface, and the method comprises the following steps:
judging whether the audio to be identified is non-compressed audio through a conversion interface; and/or the presence of a gas in the gas,
judging whether the audio format of the audio to be identified is a preset audio format or not through a conversion interface; and/or the presence of a gas in the gas,
and judging whether the audio size of the audio to be identified is not larger than the preset audio size through the conversion interface.
The judgment sequence for judging whether the audio to be identified is non-compressed audio, whether the audio is in a preset audio format and whether the audio is not larger than the preset audio size is not limited, and the three judgment steps can be carried out simultaneously or sequentially.
The number of the judgment can be determined according to the preset specification of the conversion port, if the preset specification of the conversion port only sets the specification of the non-compressed audio, only the judgment on whether the audio to be identified is the non-compressed audio is needed at the moment; if the uncompressed audio specification and the audio format specification are set, it is necessary to determine whether the audio to be recognized is an uncompressed audio or not and determine whether the audio to be recognized is a preset audio format or not, wherein the number and the type of the preset specification are not limited herein.
In some embodiments, the determining whether the audio to be recognized meets the preset specification through the conversion interface includes:
judging whether the audio to be identified is a history identification audio according to the audio name of the audio to be identified;
if the audio is the historical recognition audio, extracting a voice recognition result corresponding to the audio name from a database;
if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification through the conversion interface.
In some embodiments, after the speech recognition device receives the audio to be recognized, it is further required to receive a speech conversion instruction, and then the speech recognition device determines whether the audio to be recognized meets a preset specification through a conversion interface according to the speech conversion instruction.
103. And if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface.
The conversion interface is packaged with an audio decompression method, an audio transcoding method and/or an audio segmentation method. So specifically: if the audio frequency does not accord with the preset standard, the audio frequency to be identified is converted into the audio frequency which accords with the preset standard through the conversion interface, and the method comprises the following steps:
if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by an audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,
if the audio format is not the preset audio format, audio transcoding processing is carried out on the audio to be identified through an audio transcoding method, and the audio to be identified in the preset audio format is obtained; and/or the presence of a gas in the gas,
and if the audio size is larger than the preset audio size, performing audio segmentation processing on the audio to be identified by an audio segmentation method to obtain a plurality of sub-audios to be identified, wherein the sub-audios are not larger than the preset audio size.
If the audio to be identified is the compressed audio, then audio decompression processing is not required to be performed on the audio to be identified, and similarly, if the audio format of the audio to be identified is the preset audio format, audio transcoding processing is not required to be performed on the audio to be identified, and if the size of the audio to be identified is not larger than the size of the preset audio, segmentation processing is not required to be performed on the audio.
In some embodiments, if audio decompression, audio transcoding and audio segmentation are required to be performed on the audio to be identified, decompression is required to be performed on the audio to be identified first, then audio transcoding is performed on the audio to be identified which is subjected to the decompression, and finally segmentation is performed on the audio to be identified which is subjected to the decompression, i.e., the audio transcoding.
If the audio to be recognized is the audio meeting the preset specification, the conversion interface does not need to process the audio to be recognized at the moment, and directly transmits the audio to be recognized to the third-party interface, so that the voice recognition engine packaged on the third-party interface performs voice recognition processing on the audio information in the audio to be recognized.
104. And calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
After the audio meeting the preset specification is obtained through the conversion interface, the voice recognition device calls a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result, wherein the voice recognition result is text information corresponding to the audio to be recognized.
In some embodiments, the speech recognition engine is stored locally, that is, the speech recognition engine is stored in a server corresponding to the speech recognition device, for example, if the chat software on the mobile phone performs speech recognition (speech to text) on speech sent by the user to the chat software, the speech needs to be sent to the corresponding server, and then the speech recognition engine deployed in the server performs speech recognition on the speech to obtain text information corresponding to the speech.
In some embodiments, after the voice recognition result is obtained through the voice recognition audio, the voice recognition result is returned to the display interface of the voice recognition device.
In some embodiments, after the voice recognition result is obtained through the voice recognition audio, the voice recognition result is also stored in the database, and in addition, the audio to be recognized can also be stored in the database.
In the embodiment of the application, a voice recognition device acquires audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.
Referring to fig. 2, fig. 2 is another schematic flow chart of a speech recognition method according to an embodiment of the present application, which is described in this embodiment by taking an execution subject as a speech recognition device, a conversion interface as an http interface, and a third-party interface as a WebApi, where a specific flow of the method may be as follows:
201. the voice recognition device acquires audio to be recognized.
The audio to be identified comprises parameter information such as audio information, compression information, format information, name information and the like of the audio to be identified.
In some embodiments, the caller transmits the audio to be recognized to an http interface in the speech recognition device, and the http interface is open to the outside and provides parameter information such as the audio to be recognized for the caller to transmit.
In some embodiments, before obtaining the audio to be identified, the method further includes encapsulating an audio decompression method, an audio transcoding method, and/or an audio slicing method in the http interface.
Besides, methods such as an audio storage method and/or a text recording method can be packaged in the http interface, wherein the audio storage method is used for storing the audio to be recognized, and the text recording method is used for storing the recognized voice recognition result into a local database or other related databases.
In some embodiments, prior to obtaining the audio to be identified, the method further comprises: carrying out localized deployment on a voice recognition engine; and packaging the WebApi packaged with the voice recognition engine into an http interface.
In some embodiments, before encapsulating the WebApi into the http interface, encapsulating the locally deployed speech recognition engine into the WebApi is further included.
202. The voice recognition device judges whether the audio to be recognized is a history recognition audio according to the audio name of the audio to be recognized, if not, step 203 is executed, and if so, step 211 is executed.
After the speech recognition device receives the audio to be recognized, in order to save resources, it may first determine whether the audio to be recognized is an already recognized audio according to the audio name of the audio to be recognized, and specifically, the speech recognition device stores the audio name of the already recognized audio and a speech recognition result corresponding to the already recognized audio.
If the audio to be recognized is the audio that has already been recognized, the voice recognition result corresponding to the audio name is extracted from the database without recognizing the given voice, and if the audio to be recognized is not the audio that has been subjected to the voice recognition, the following steps are executed.
203. The voice recognition device judges whether the audio to be recognized is non-compressed audio through the http interface, if not, step 204 is executed, and if yes, step 205 is executed.
Because the specification of the audio to be recognized of the voice recognition audio corresponding to the voice recognition device has the non-compressed audio, after the audio to be recognized is received, whether the audio to be recognized is the non-compressed audio can be judged through the http interface, if so, the next judgment step can be executed, and if not, the audio to be recognized also needs to be subjected to audio decompression.
204. And the voice recognition equipment performs audio decompression processing on the audio to be recognized through an audio decompression method to obtain the uncompressed audio to be recognized.
Because the audio decompression method is packaged in the http interface, when the audio to be identified is judged to be the compressed audio, the audio to be identified can be subjected to audio decompression processing through the audio decompression method, so that the non-compressed audio to be identified is obtained.
205. The voice recognition device judges whether the audio format of the audio to be recognized is a preset audio format through the http interface, if not, step 206 is executed, and if so, step 207 is executed.
Since the audio format is required to be a preset audio format, for example, an MP3 format, in the specification of the audio to be recognized of the speech recognition audio corresponding to the speech recognition device, after receiving the audio to be detected, or after performing audio decompression processing on the audio to be detected, it is determined whether the audio format of the audio to be recognized is the preset audio format through the http interface.
Specifically, after audio decompression processing is performed on the audio to be detected (if audio decompression processing is required), and an uncompressed audio to be recognized is obtained, the speech recognition device determines, through the http interface, whether the audio format of the audio to be recognized that has undergone audio decompression processing is a preset audio format, if so, may execute the next determination step, and if not, further performs audio transcoding processing on the audio to be recognized (if the previous audio compression processing has been performed, the previous audio to be recognized that has undergone audio decompression processing is here).
206. The voice recognition equipment carries out audio transcoding processing on the audio to be recognized through an audio transcoding method to obtain the audio to be recognized in a preset audio format.
Because the http interface is encapsulated with the audio transcoding method, when the audio format of the audio to be identified is determined not to be the preset audio format (for example, not to be the MP3 format), the audio to be identified needs to be subjected to audio transcoding processing by the audio transcoding method, so as to obtain the audio to be identified in the preset audio format, wherein if the audio to be identified is subjected to audio decompression processing before, at this time, the audio to be identified subjected to audio decompression processing is subjected to audio transcoding processing by the audio transcoding method, so as to obtain the audio to be identified conforming to the preset audio format.
207. The voice recognition device judges whether the audio size of the audio to be recognized is not larger than the preset audio size through the http interface, if not, step 208 is executed, and if so, step 209 is executed.
Because the audio size required in the specification of the audio to be recognized of the voice recognition audio corresponding to the voice recognition device cannot be larger than the preset audio size, after the audio to be detected is received, or after the audio to be detected is subjected to audio decompression processing and/or audio transcoding processing, whether the audio length of the audio to be recognized is not larger than the preset audio size or not is judged through the http interface.
208. The voice recognition equipment carries out audio segmentation processing on the audio to be recognized through an audio segmentation method to obtain a plurality of sub-audios to be recognized, wherein the sub-audios are not larger than the preset audio size.
The method comprises the steps that an http interface is packaged with an audio segmentation method, when the size (byte size) of the audio to be detected is judged to be larger than the preset audio size, the audio to be recognized is subjected to audio segmentation processing through the audio segmentation method, a plurality of sub-audios to be recognized are obtained, the sub-audios to be recognized are not larger than the preset audio size, and then the sub-audios to be recognized are sequentially sent to a third-party interface packaged with a voice recognition engine, so that the audio recognition audio in the third-party interface is subjected to audio recognition processing on the sub-audios to be recognized.
If the audio to be identified is subjected to audio decompression processing and/or audio transcoding processing before, then audio segmentation processing is performed on the audio to be identified which is subjected to the audio decompression processing and/or the audio transcoding processing at this time.
The specific code for performing audio segmentation processing on the audio to be recognized may be as follows:
Figure BDA0002053387900000121
Figure BDA0002053387900000131
209. and calling the WebApi packaged with the voice recognition engine by the voice recognition equipment to perform voice recognition processing on the audio conforming to the preset specification to obtain a voice recognition result.
After the voice recognition device converts the audio to be recognized into the audio meeting the preset specification through the http interface, the WebApi packaged with the voice recognition engine is called to perform voice recognition processing on the audio meeting the preset specification, and a voice recognition result is obtained.
Specifically, in some embodiments, the speech recognition engine is stored locally, that is, the speech recognition engine is stored in a server corresponding to the speech recognition device, for example, if the chat software on the mobile phone performs speech recognition (speech to text) on the speech sent by the user to the chat software, the speech needs to be sent to the corresponding server, and then the speech recognition engine deployed in the server performs speech recognition on the speech to obtain text information corresponding to the speech.
In some embodiments, after the voice recognition result is obtained through the voice recognition audio, the voice recognition result is returned to the display interface of the voice recognition device.
210. The speech recognition device stores the speech recognition result in a database.
In some embodiments, after the voice recognition result is obtained through the voice recognition audio, the voice recognition result is also stored in the database, and in addition, the audio to be recognized can also be stored in the database.
211. And the voice recognition equipment extracts a voice recognition result corresponding to the audio name from the database.
Since the database stores the audio recognition result of the audio subjected to historical recognition and the audio name corresponding to the audio, after the voice recognition device receives the audio to be recognized, in order to improve the recognition speed and save the recognition resource, whether the audio to be recognized is the historical recognition audio can be judged according to the audio name of the audio to be recognized, and if the audio is the historical recognition audio, the voice recognition result corresponding to the audio name can be directly extracted from the database according to the audio name.
The method and the device can solve a series of problems that the conventional voice open platform on the market is complex in developing voice escape function configuration, not supported by formats, too large in files, various limitations and the like, improve development efficiency and save development cost.
In the embodiment of the application, a voice recognition device acquires audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.
In order to better implement the voice recognition method provided by the embodiment of the present application, the embodiment of the present application further provides a voice recognition device, where the voice recognition device may be specifically integrated in a network device, and the network device may be a server or a terminal. Wherein the meaning of the noun is the same as that in the voice recognition method, and the specific implementation details can refer to the description in the method embodiment.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application, where the speech recognition device includes: the acquisition unit 301, the judgment unit 302, the conversion unit 303, and the identification unit 304 are as follows:
an obtaining unit 301, configured to obtain an audio to be identified;
a determining unit 302, configured to determine whether the audio to be identified meets a preset specification through a conversion interface;
the conversion unit 303 is configured to convert the audio to be recognized into an audio meeting the preset specification through the conversion interface when the audio to be recognized does not meet the preset specification;
and the recognition unit 304 is configured to call a third party interface encapsulated with a speech recognition engine to perform speech recognition processing on the audio meeting the preset specification, so as to obtain a speech recognition result.
In some embodiments, the determining unit 302 is specifically configured to:
judging whether the audio to be identified is non-compressed audio or not through the conversion interface; and/or the presence of a gas in the gas,
judging whether the audio format of the audio to be identified is a preset audio format or not through the conversion interface; and/or the presence of a gas in the gas,
and judging whether the audio size of the audio to be identified is not larger than a preset audio size through the conversion interface.
Referring to fig. 4, in some embodiments, the apparatus further includes:
a first encapsulating unit 305 for encapsulating an audio decompression method, an audio transcoding method and/or an audio slicing method in the conversion interface.
In some embodiments, the conversion unit 303 is specifically configured to:
if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by the audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,
if the audio format is not the preset audio format, carrying out audio transcoding processing on the audio to be identified through the audio transcoding method to obtain the audio to be identified in the preset audio format; and/or the presence of a gas in the gas,
and if the audio frequency is larger than the preset audio frequency, carrying out audio frequency segmentation processing on the audio frequency to be identified through the audio frequency segmentation method to obtain a plurality of sub audio frequencies to be identified, wherein the sub audio frequencies to be identified are not larger than the preset audio frequency.
In some embodiments, the obtaining unit 301 is specifically configured to:
judging whether the audio to be identified is a history identification audio according to the audio name of the audio to be identified;
if the audio is the historical recognition audio, extracting a voice recognition result corresponding to the audio name from a database;
if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification or not through the conversion interface.
In some embodiments, the apparatus further comprises:
a storage unit 306, configured to store the speech recognition result in the database.
In some embodiments, the apparatus further comprises:
a deployment unit 307, configured to perform localized deployment on the speech recognition engine;
a second packaging unit 308, configured to package the locally deployed speech recognition engine into the third-party interface;
a third packaging unit 309, configured to package the third party interface into the conversion interface.
In the embodiment of the application, the obtaining unit 301 obtains the audio to be identified; then, the determining unit 302 determines whether the audio to be recognized meets the preset specification through the conversion interface; if the audio does not meet the preset specification, the conversion unit 303 converts the audio to be recognized into the audio meeting the preset specification through the conversion interface; then, the recognition unit 304 calls a third-party interface encapsulated with a speech recognition engine to perform speech recognition processing on the audio meeting the preset specification, so as to obtain a speech recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.
Referring to fig. 5, the present application provides a network device 500, which may include one or more processors 501 of a processing core, one or more memories 502 of a computer-readable storage medium, a Radio Frequency (RF) circuit 503, a power supply 504, an input unit 505, and a display unit 506. Those skilled in the art will appreciate that the network device architecture shown in fig. 5 does not constitute a limitation of network devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 501 is a control center of the network device, connects various parts of the entire network device by using various interfaces and lines, and performs various functions of the network device and processes data by running or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the network device. Optionally, processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.
The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502.
The RF circuit 503 may be used for receiving and transmitting signals during the process of transmitting and receiving information.
The network device also includes a power supply 504 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 501 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
The network device may further include an input unit 505, and the input unit 505 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The network device may also include a display unit 506, and the display unit 506 may be used to display information input by or provided to the user as well as various graphical user interfaces of the network device, which may be made up of graphics, text, icons, video, and any combination thereof. Specifically, in this embodiment, the processor 501 in the network device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application program stored in the memory 502, so as to implement various functions as follows:
acquiring audio to be identified;
judging whether the audio to be identified meets a preset specification or not through a conversion interface;
if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface;
and calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
As can be seen from the above, in the embodiment of the present application, the speech recognition apparatus obtains the audio to be recognized; then judging whether the audio to be identified meets the preset specification or not through a conversion interface; if the audio frequency does not accord with the preset standard, converting the audio frequency to be identified into the audio frequency which accords with the preset standard through the conversion interface; and then calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result. The voice recognition device in the embodiment of the application can convert the audio to be recognized which does not accord with the preset specification into the audio which accords with the preset specification through the conversion interface, so that the development cost of the voice recognition engine can be reduced.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the speech recognition processing methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:
acquiring audio to be identified;
judging whether the audio to be identified meets a preset specification or not through a conversion interface;
if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface;
and calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in any of the speech recognition methods provided in the embodiments of the present application, the beneficial effects that can be achieved by any of the speech recognition methods provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.
The foregoing has described in detail a speech recognition method, apparatus, computer program product and storage medium provided by embodiments of the present application, and specific embodiments are applied in the present application to explain the principles and implementations of the present application, and the description of the foregoing embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A speech recognition method, comprising:
acquiring audio to be identified;
judging whether the audio to be identified meets a preset specification or not through a conversion interface;
if the audio frequency does not accord with the preset specification, converting the audio frequency to be identified into the audio frequency which accords with the preset specification through the conversion interface;
and calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
2. The method according to claim 1, wherein the determining whether the audio to be recognized conforms to a preset specification through the conversion interface comprises:
judging whether the audio to be identified is non-compressed audio or not through the conversion interface; and/or the presence of a gas in the gas,
judging whether the audio format of the audio to be identified is a preset audio format or not through the conversion interface; and/or the presence of a gas in the gas,
and judging whether the audio size of the audio to be identified is not larger than a preset audio size through the conversion interface.
3. The method of claim 2, wherein prior to obtaining the audio to be identified, the method further comprises:
and encapsulating an audio decompression method, an audio transcoding method and/or an audio segmentation method in the conversion interface.
4. The method according to claim 3, wherein if the audio to be recognized does not meet the predetermined specification, converting the audio to be recognized into the audio meeting the predetermined specification through the conversion interface comprises:
if the audio to be identified is not the uncompressed audio, performing audio decompression processing on the audio to be identified by the audio decompression method to obtain the uncompressed audio to be identified; and/or the presence of a gas in the gas,
if the audio format is not the preset audio format, carrying out audio transcoding processing on the audio to be identified through the audio transcoding method to obtain the audio to be identified in the preset audio format; and/or the presence of a gas in the gas,
and if the audio frequency is larger than the preset audio frequency, carrying out audio frequency segmentation processing on the audio frequency to be identified through the audio frequency segmentation method to obtain a plurality of sub audio frequencies to be identified, wherein the sub audio frequencies to be identified are not larger than the preset audio frequency.
5. The method according to claim 1, wherein the determining whether the audio to be recognized conforms to a preset specification through the conversion interface comprises:
judging whether the audio to be identified is a history identification audio according to the audio name of the audio to be identified;
if the audio is the historical recognition audio, extracting a voice recognition result corresponding to the audio name from a database;
if the audio is not the historical identification audio, judging whether the audio to be identified meets the preset specification or not through the conversion interface.
6. The method according to claim 5, wherein after the calling of the third-party interface encapsulated with the speech recognition engine performs speech recognition processing on the audio meeting the preset specification to obtain a speech recognition result, the method further comprises:
and storing the voice recognition result into the database.
7. The method according to any one of claims 1 to 6, wherein before the obtaining the audio to be identified, the method further comprises:
carrying out localized deployment on the voice recognition engine;
packaging the locally deployed speech recognition engine into the third-party interface;
and packaging the third party interface into the conversion interface.
8. A speech recognition apparatus, comprising:
the acquisition unit is used for acquiring the audio to be identified;
the judging unit is used for judging whether the audio to be identified meets the preset specification through a conversion interface;
the conversion unit is used for converting the audio to be recognized into the audio meeting the preset specification through the conversion interface when the audio to be recognized does not meet the preset specification;
and the recognition unit is used for calling a third-party interface packaged with a voice recognition engine to perform voice recognition processing on the audio meeting the preset specification to obtain a voice recognition result.
9. A computer program product comprising instructions which, when run on a computer, cause the computer to carry out the steps in the speech recognition method according to any one of claims 1 to 7.
10. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the speech recognition method according to any one of claims 1 to 7.
CN201910381065.8A 2019-05-08 2019-05-08 Voice recognition method, device, computer program product and storage medium Pending CN112002312A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910381065.8A CN112002312A (en) 2019-05-08 2019-05-08 Voice recognition method, device, computer program product and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910381065.8A CN112002312A (en) 2019-05-08 2019-05-08 Voice recognition method, device, computer program product and storage medium

Publications (1)

Publication Number Publication Date
CN112002312A true CN112002312A (en) 2020-11-27

Family

ID=73461215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910381065.8A Pending CN112002312A (en) 2019-05-08 2019-05-08 Voice recognition method, device, computer program product and storage medium

Country Status (1)

Country Link
CN (1) CN112002312A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071528A1 (en) * 2006-09-14 2008-03-20 Portalplayer, Inc. Method and system for efficient transcoding of audio data
EP2003640A2 (en) * 2007-06-15 2008-12-17 LG Electronics Inc. Method and system for generating and processing digital content based on text-to-speech conversion
CN105808651A (en) * 2016-02-29 2016-07-27 四川秘无痕信息安全技术有限责任公司 Android WeChat based silk_v3 voice file format decoding method
CN106375778A (en) * 2016-08-12 2017-02-01 南京青衿信息科技有限公司 Method for transmitting three-dimensional audio program code stream satisfying digital cinema standard
CN107104994A (en) * 2016-02-22 2017-08-29 华硕电脑股份有限公司 Audio recognition method, electronic installation and speech recognition system
US20170270916A1 (en) * 2016-03-17 2017-09-21 Toyota Motor Engineering & Manufacturing North America, Inc. Voice interface for a vehicle
CN107342088A (en) * 2017-06-19 2017-11-10 联想(北京)有限公司 A kind of conversion method of acoustic information, device and equipment
CN107516534A (en) * 2017-08-31 2017-12-26 广东小天才科技有限公司 Voice information comparison method and device and terminal equipment
CN108257590A (en) * 2018-01-05 2018-07-06 携程旅游信息技术(上海)有限公司 Voice interactive method, device, electronic equipment, storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071528A1 (en) * 2006-09-14 2008-03-20 Portalplayer, Inc. Method and system for efficient transcoding of audio data
EP2003640A2 (en) * 2007-06-15 2008-12-17 LG Electronics Inc. Method and system for generating and processing digital content based on text-to-speech conversion
CN107104994A (en) * 2016-02-22 2017-08-29 华硕电脑股份有限公司 Audio recognition method, electronic installation and speech recognition system
CN105808651A (en) * 2016-02-29 2016-07-27 四川秘无痕信息安全技术有限责任公司 Android WeChat based silk_v3 voice file format decoding method
US20170270916A1 (en) * 2016-03-17 2017-09-21 Toyota Motor Engineering & Manufacturing North America, Inc. Voice interface for a vehicle
CN106375778A (en) * 2016-08-12 2017-02-01 南京青衿信息科技有限公司 Method for transmitting three-dimensional audio program code stream satisfying digital cinema standard
CN107342088A (en) * 2017-06-19 2017-11-10 联想(北京)有限公司 A kind of conversion method of acoustic information, device and equipment
CN107516534A (en) * 2017-08-31 2017-12-26 广东小天才科技有限公司 Voice information comparison method and device and terminal equipment
CN108257590A (en) * 2018-01-05 2018-07-06 携程旅游信息技术(上海)有限公司 Voice interactive method, device, electronic equipment, storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任萍萍: "智能客服机器人", 31 August 2017, 成都时代出版社, pages: 45 - 49 *

Similar Documents

Publication Publication Date Title
US20190370685A1 (en) Method and apparatus for generating model, method and apparatus for recognizing information
CN109117361B (en) Remote debugging method, related equipment and system for small program
US8024194B2 (en) Dynamic switching between local and remote speech rendering
US8095939B2 (en) Managing application interactions using distributed modality components
CN107534685B (en) Business processing method and device, readable storage medium and chip system
US7363027B2 (en) Sequential multimodal input
CN107402791B (en) Application processing method and device, storage medium and terminal
CN106406940A (en) System upgrading method, system upgrading apparatus, and terminal
CN112637428A (en) Invalid call judgment method and device, computer equipment and storage medium
CN111078316B (en) Layout file loading method and device, storage medium and electronic equipment
CN111914072A (en) Information interaction method, equipment and device
US20040254787A1 (en) System and method for distributed speech recognition with a cache feature
CN114647590A (en) Test case generation method and related device
CN110599581A (en) Image model data processing method and device and electronic equipment
CN108762712B (en) Electronic device control method, electronic device control device, storage medium and electronic device
CN111897916B (en) Voice instruction recognition method, device, terminal equipment and storage medium
CN116633804A (en) Modeling method, protection method and related equipment of network flow detection model
CN112002312A (en) Voice recognition method, device, computer program product and storage medium
CN111147530B (en) System, switching method, intelligent terminal and storage medium of multi-voice platform
CN107154996B (en) Incoming call interception method and device, storage medium and terminal
CN110275701A (en) Data processing method, device, medium and calculating equipment
CN113643706B (en) Speech recognition method, device, electronic equipment and storage medium
CN113852718B (en) Voice channel establishing method and device, electronic equipment and storage medium
CN112862073B (en) Compressed data analysis method and device, storage medium and terminal
CN113724711A (en) Method, device, system, medium and equipment for realizing voice recognition service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination