CN115565528A

CN115565528A - Chinese voice awakening method, system, electronic equipment and storage medium

Info

Publication number: CN115565528A
Application number: CN202211152031.XA
Authority: CN
Inventors: 王宇
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2023-01-03

Abstract

The embodiment of the invention provides a Chinese voice awakening method, a Chinese voice awakening system, electronic equipment and a storage medium. The method comprises the following steps: dividing the characters in the Chinese speech into different categories with different pronunciation definition degrees by utilizing the finals; determining a category searching sequence for awakening core characters in Chinese speech based on the pronunciation definition category of each character in the awakening words of the intelligent equipment; determining at least one awakening core word in the Chinese speech through a category search sequence; and determining a wake-up word search interval in the Chinese speech according to at least one wake-up core word, and performing voice wake-up when the score of each word in the wake-up word search interval reaches a preset threshold value. The embodiment of the invention starts from the composition of Chinese pinyin, takes the pronunciation characteristics of the initial consonants and the final consonants into consideration, selects the final consonants with loud pronunciation, and then finds out other characters by taking the final consonants as datum points. Therefore, the influence of the speed of speech on the awakening accuracy rate is solved, the awakening accuracy rate is ensured, and the experience of awakening by using Chinese speech of a user is improved.

Description

Chinese voice awakening method, system, electronic equipment and storage medium

Technical Field

The present invention relates to the field of intelligent voice, and in particular, to a method, a system, an electronic device, and a storage medium for waking up a chinese speech.

Background

At present, voice interaction technology is becoming more mature, many intelligent hardware manufacturers add voice functions to their products to serve as selling points of their products. However, these smart device products must first "wake up" before entering the "voice interaction" mode. Although voice wake-up has keys or touch screens to activate wake-up, some smart device products do not have screens, and few buttons of the smart device products have limitations in use, so that the voice wake-up method is adopted to reduce learning cost and complexity of operation.

To achieve voice wake-up, usually a wake-up point is used as a reference point, and in an interval of a fixed length, the position of a word is searched, and then a threshold value of each word in a wake-up word is checked to determine whether to need to wake-up.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the above-mentioned waking up method has better waking up accuracy for normal speech speed, but if the speaker has faster speech speed or slower speech speed, the waking up accuracy will be worse.

Disclosure of Invention

The method at least solves the problem that the speaking speed of a speaker influences the awakening accuracy rate in the prior art. In a first aspect, an embodiment of the present invention provides a chinese speech wake-up method, applied to an intelligent device, including:

dividing the characters in the Chinese speech into different pronunciation definition categories by utilizing a vowel;

determining a category searching sequence for awakening core characters in the Chinese speech based on the pronunciation definition category of each character in the awakening words of the intelligent equipment;

determining at least one awakening core word in the Chinese speech through the category searching sequence;

and determining a wake-up word search interval in the Chinese speech according to the at least one wake-up core word, and performing voice wake-up when the score of each word in the wake-up word search interval reaches a preset threshold value.

In a second aspect, an embodiment of the present invention provides a chinese speech wake-up system, including:

the category dividing program module is used for dividing the characters in the Chinese speech into different pronunciation definition categories by utilizing the finals;

the sequence determination program module is used for determining a category searching sequence for awakening the core characters in the Chinese speech based on the categories of pronunciation definition of all characters in the awakening words of the intelligent equipment;

a core word determining program module, configured to determine at least one awakening core word in the chinese speech according to the category search sequence;

and the voice awakening program module is used for determining an awakening word searching interval in the Chinese voice according to the at least one awakening core word, and performing voice awakening when the score of each word in the awakening word searching interval reaches a preset threshold value.

In a third aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the chinese voice wake-up method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the chinese speech wake-up method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: starting from the composition of Chinese pinyin, the pronunciation characteristics of the initials and finals are considered, the finals with loud pronunciations are selected, the clearest character in the awakening word is found, and then other characters are found by taking the clearest character as a reference point. Therefore, the influence of the speech rate on the awakening accuracy rate is avoided, the awakening accuracy rate is ensured, meanwhile, mistaken awakening is further avoided, and the experience of awakening by using Chinese speech of a user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for waking up a chinese speech according to an embodiment of the present invention;

fig. 2 is a flow chart of core word searching of a method for waking up a chinese speech according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a false wake-up determination of a chinese speech wake-up method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a chinese speech wake-up system according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an embodiment of an electronic device awakened by chinese speech according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for waking up a chinese speech according to an embodiment of the present invention, which includes the following steps:

s11: dividing the characters in the Chinese speech into different pronunciation definition categories by utilizing a vowel;

s12: determining a category searching sequence for awakening core characters in the Chinese speech based on the pronunciation definition category of each character in the awakening words of the intelligent equipment;

s13: determining at least one awakening core word in the Chinese speech through the category search sequence;

s14: and determining a wake word search interval in the Chinese speech according to the at least one wake core word, and performing voice wake-up when the score of each word in the wake words in the wake word search interval reaches a preset threshold value.

In the embodiment, considering that most domestic users usually use Chinese to perform voice awakening (such as Xiao ai, xiao Bu and Xiao), starting from the composition of Chinese Pinyin, the pronunciation characteristics of the initials and finals are considered, the finals with loud pronunciation are selected, the clearest character in the awakened words is found, and then the clearest character is found by taking the character as the reference point. Therefore, the influence of the speech speed on the awakening accuracy rate under the domestic voice awakening use scene is considered.

For step S11, the chinese language is analyzed, and the chinese audio of the chinese pronunciation is composed of an initial and a final. The vowels include vowel head, vowel abdomen and vowel tail. The abdomen is the key to the pronunciation of the vowel, which is the loudest part in the pronunciation process. This loudest part is usually the important factor for Chinese speech arousal. For example, a user phonetically inputs "hello buns", which are classified into different articulation clarity categories by the use of a final abdomen.

As an embodiment, the dividing the words in the chinese speech into different articulation clarity categories by using finals includes:

dividing the characters with the vowel abdomen being a in the Chinese speech into a first pronunciation definition degree category,

dividing the characters with the finals of o or e in the Chinese speech into a second pronunciation definition category,

dividing the words with the finals of i or u in the Chinese speech into a third pronunciation definition category, wherein the pronunciation definition categories from high to low are the first pronunciation definition category, the second pronunciation definition category and the third pronunciation definition category.

In brief, the characters are classified according to their rhymes, and are classified into three categories:

a type: and the fina is alpha.

B type: the abdomen is o and e.

Class C: yu, u and u.

If the three types of finals of the awakening word are available, the A type pronunciation is loudest and clear, and the position of the A type pronunciation is determined to be preferentially positioned. When the comparison threshold value of each word in the awakening word is set, the class A > the class B > the class C.

For step S12, for example, the wake-up word is also "hello swatch", specifically, "hello swatch": the good and small in the ni hao xiao bu belong to the type A words, the positions of the good and small are more easily judged in a section of audio of the good cloth, and the remaining category search sequence for awakening the core words is the type A, the type B and the type C in sequence.

If the awakening word is "hey stone": "hei shi tou", then there is no compound vowel of class a at this time, the category search order of the awakening core word is class B and class C in turn (because the awakening word at this time has no compound vowel of class a, the compound vowel of class B is preferentially located).

For step S13, as shown in fig. 2, which is a flowchart of searching for a core word in a wake word, the wake word usually includes 4 to 6 words. (e.g., the wake word is set to "hello puffs", the voice input by the user is also "hello puffs"). Then the sequence at this point is:

(1) Setting the searched word class, and initially setting X = A class.

(2) And searching the number of all characters belonging to the X class in the awakening words, and recording the number as count.

(3) If count =0, then decrease by one step, e.g., class a decreases to class B, and the lookup continues.

(4) If count =1, or count =2, the corresponding word is marked as the core word.

(5) If count >2, then the first and last class A words are taken and marked as core words.

For example, "hello little cloth" determines "good" and "small" at the initial a class, count =2, and these two words are labeled as core words.

If the awakening word is 'hey stone', the initial X = type B, the rest steps are voice input by a user, and the numerical value of count is judged, which is not described herein again.

As an embodiment, the determining the wakeup word search interval in the chinese speech according to the at least one wakeup core word includes:

when only one awakening core word exists, taking the awakening core word as a reference point, and determining the search interval of other words in the awakening word through the reference point;

when there are two wakeup core words, this includes: and selecting a character between the first awakening core character and the second awakening core character as a reference point based on the character distance, and determining the search interval of other characters in the awakening word through the reference point.

In this embodiment, if there are only 1 core word, the search space for other words is determined based on the word as the reference point, for example, a word with a distance of 3 from the word at the left and right of the reference point.

If there are 2 core words, the core word with close distance is selected as the reference point to determine the search interval. Namely:

the word to the left of the first core word is used as a reference point to determine a search space.

The word to the right of the second core word is referenced to determine the search space.

The word in the middle of the two core words takes the closer distance as a reference point, the distance is the same, and the score is higher as the reference point, so that the search interval is determined.

For example, a user inputs a sentence of 'listen to the song for a moment, you good a little help me play the night song', and the prior art generally starts with 'one \8230;' in the sentence and searches for a wakeup word. And the core word limits the search interval, so that the left side of the 'good' word and the right side of the 'small' word can be determined as soon as possible, and the search interval of 'hello cloth' is directly positioned. If the awakening word is 'small cloth' and 'small' is determined in the initial class A, then the 'small' word is used as a reference point to determine a search interval with a fixed left and right range, and the search interval is preferentially searched.

After the position of the reference point in the Chinese sentence is determined, determining the search interval of other words:

(1) The minimum length of a word is set to L1, and the maximum length of the word is set to L2.

(2) The physical distance between the word to be determined and the reference point (a certain core word) is D. For example, in "hello cloth," the physical distance of "you" and "good" is 1, and the physical distance of "you" and "small" is 2.

(3) The position of the reference point is X.

Then:

when the word to be determined is to the right of the reference point, its search interval is [ X + D × L1, X + D × L2].

When the word to be determined is to the right of the reference point, its search interval is [ X-D × L2, X-D × L1].

With respect to step S14, a wake word search interval may be determined through the above steps, and whether the word is woken up by voice is determined by determining scores of wake words in the wake word search interval, for example, as shown in fig. 3, a score of each word in the wake words is calculated through an acoustic model. For example, only if the scores of the four words "hello swatch" all exceed a preset threshold will the wake-up be performed.

As an implementation manner, when the score of any character in the wakeup word search interval does not reach the preset threshold, it is determined that the wakeup is mistaken, and the voice wakeup is not performed. For example, in the "hello small cloth", only the score of "cloth" is 20 points, and the preset 50 points are not reached. When the scores of other words exceed the preset score, the words are judged to be mistakenly awakened at the moment, and the awakening processing is not carried out.

The embodiment can be seen that starting from the composition of Chinese pinyin, the pronunciation characteristics of the initials and finals are considered, the finals with loud pronunciations are selected, the clearest character in the awakening word is found, and then other characters are found by taking the clearest character as the reference point. Therefore, the influence of the speed of speech on the awakening accuracy rate is avoided, the awakening accuracy rate is ensured, meanwhile, mistaken awakening is further avoided, and the experience of awakening by using Chinese speech of a user is improved.

Fig. 4 is a schematic structural diagram of a chinese speech wake-up system according to an embodiment of the present invention, which can execute the chinese speech wake-up method according to any of the above embodiments and is configured in a terminal.

The chinese speech wake-up system 10 provided in this embodiment includes: a category classification program module 11, an order determination program module 12, a core word determination program module 13, and a voice wake-up program module 14.

The category classification program module 11 is configured to classify the characters in the chinese speech into different pronunciation clarity categories by using a vowel; the sequence determination program module 12 is configured to determine a category search sequence for waking up a core word in the chinese speech based on the pronunciation definition category of each word in the wake-up word of the smart device; the core word determination program module 13 is configured to determine at least one awakening core word in the chinese speech through the category search sequence; the voice awakening program module 14 is configured to determine an awakening word search interval in the chinese voice according to the at least one awakening core word, and perform voice awakening when a score of each word in the awakening word search interval reaches a preset threshold.

Further, the category classification program module is for:

Further, the core word determining program module is for:

when there are two wake-up core words, it includes: and selecting a character between the first awakening core character and the second awakening core character as a reference point based on the character distance, and determining the search interval of other characters in the awakening word through the reference point.

Further, the voice wake-up program module is configured to: and when the score of any character in the awakening word searching interval does not reach a preset threshold value, judging that the awakening is mistakenly awakened, and not carrying out voice awakening.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the Chinese voice awakening method in any method embodiment;

as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-transitory computer-readable storage medium, it may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform a chinese voice wake-up method in any of the method embodiments described above.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to a method for waking up a chinese speech according to another embodiment of the present application, where as shown in fig. 5, the device includes:

one or more processors 510 and memory 520, with one processor 510 being illustrated in fig. 5. The device of the Chinese speech awakening method can further comprise: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.

The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the chinese speech wake-up method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, namely, implements the chinese voice wake-up method of the above-described method embodiments.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to a mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information. The output device 540 may include a display device such as a display screen.

The one or more modules are stored in the memory 520 and, when executed by the one or more processors 510, perform the chinese voice wake-up method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the chinese voice wake-up method of any of the embodiments of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functions and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has the functions of calculation and processing, and generally has the mobile internet access characteristic. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players, handheld game consoles, electronic books, as well as smart toys and portable vehicle navigation devices.

(4) Other electronic devices with data processing functions.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese voice awakening method is applied to intelligent equipment and comprises the following steps:

dividing the characters in the Chinese speech into different pronunciation definition categories by utilizing the finals;

determining at least one awakening core word in the Chinese speech through the category search sequence;

2. The method of claim 1, wherein said using a vowel to classify words in said chinese speech into different articulation clarity categories comprises:

3. The method of claim 1, wherein the determining a wake word search interval in the chinese speech from the at least one wake core word comprises:

4. The method of claim 1, wherein when a score of any word in the wakeup word search interval does not reach a preset threshold, it is determined that the word is mistakenly waken and voice wakening is not performed.

5. A chinese speech wake-up system comprising:

6. The system of claim 5, wherein the category-partitioning program module is to:

7. The system of claim 5, wherein the core word determining program module is to:

when only one awakening core word is available, determining the search interval of other words in the awakening word by taking the awakening core word as a reference point through the reference point;

8. The system of claim 5, wherein the voice wake-up program module is to: and when the score of any character in the awakening word searching interval does not reach a preset threshold value, judging that the awakening is mistaken and not carrying out voice awakening.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.