CN114974252A

CN114974252A - Voice processing method, device, electronic equipment and medium

Info

Publication number: CN114974252A
Application number: CN202210416742.7A
Authority: CN
Inventors: 马宏; 王敏; 殷腾龙
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-30

Abstract

The present disclosure relates to a method, an apparatus, an electronic device and a medium for processing voice, and more particularly, to the field of voice processing technology; wherein, the method comprises the following steps: identifying voice data to obtain a corresponding target identification text and a target voiceprint characteristic; determining a target user according to the target voiceprint characteristics; determining target information corresponding to a target recognition text based on an error correction map corresponding to a target user, wherein the error correction map comprises: the corresponding relation between the target identification text and the target information; and acquiring the similarity between the target identification text and the target information, and modifying the target identification text into the target information if the similarity exceeds a preset threshold. The embodiment of the disclosure can correct the voice data of the target user, thereby being beneficial to improving the error correction speed and improving the use experience of the user.

Description

Voice processing method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, an electronic device, and a medium.

Background

With the popularization of voice assistants, more and more electronic devices, such as various household appliances, terminal devices and the like, have a voice recognition function, and a user can conveniently search and control the electronic devices in a voice mode. When errors occur in the voice data, the search results and the control results are directly affected, and thus, the voice processing becomes particularly important.

The voice processing can be aimed at different crowds, for example, in recent years, preschool children account for more and more crowds who use voice to search and control corresponding electronic equipment, and juvenile data in search data is searched for more, and the preschool children have the following problems in voice search and control: the method comprises the following steps of firstly, saying that the subjectivity is high, the search content is described by using simple words, and title and role explanation are confused; secondly, the hat is easy to wear and is like to construct new words. The traditional voice processing method such as editing distance error correction and pinyin similarity error correction is adopted to perform voice processing on preschool children, so that the accuracy is low, and the correct result is difficult to search or the true control intention is difficult to determine.

Disclosure of Invention

In order to solve the above-mentioned technology or at least partially solve the above-mentioned technical problem, the present disclosure provides a voice processing method, apparatus, electronic device, and medium, which can correct voice data of a target user, and is beneficial to improving error correction speed and improving user experience.

In order to achieve the above purpose, the technical solutions provided by the embodiments of the present disclosure are as follows:

in a first aspect, the present disclosure provides a speech processing method, including:

identifying voice data to obtain a corresponding target identification text and a target voiceprint characteristic;

determining a target user according to the target voiceprint characteristics;

determining target information corresponding to the target recognition text based on an error correction map corresponding to the target user, wherein the error correction map comprises: the corresponding relation between the target identification text and the target information;

and acquiring the similarity between the target identification text and the target information, and modifying the target identification text into the target information if the similarity exceeds a preset threshold value.

As an optional implementation manner of the embodiment of the present disclosure, the obtaining of the similarity between the target recognition text and the target information includes:

determining path information corresponding to the target recognition text based on the error correction map;

and determining the similarity between the target recognition text and the target information according to the path information.

As an optional implementation manner of the embodiment of the present disclosure, the path information includes: the target identification text respectively corresponds to first path information and second path information under different path types;

the determining the similarity between the target recognition text and the target information according to the path information includes:

determining a first probability value corresponding to a first path type based on the first path information, and determining a second probability value corresponding to a second path type based on the second path information;

and determining the similarity of the target recognition text and the target information according to the first probability value and the second probability value.

As an optional implementation manner of the embodiment of the present disclosure, the path information includes: the target recognition text is in a corresponding path, the related probability of the participles represented by each child node and the target recognition text and the weight factor corresponding to the participles represented by each child node are obtained;

and determining the similarity between the target recognition text and the target information according to the relevant probability and the corresponding weight factor.

As an optional implementation manner of the embodiment of the present disclosure, the method further includes:

acquiring a first participle contained in corpus information related to the target information and a second participle contained in label information corresponding to the target information;

and establishing the error correction map by taking the target information as a central node, taking the first participle, the second participle and the generated information as child nodes, taking a first incidence relation between the central node and different child nodes as an edge between the central node and different child nodes, and taking a second incidence relation between the child nodes as an edge between the child nodes.

determining the probability corresponding to the first participle according to the frequency corresponding to the syntactic dependency relationship of the first participle or the part-of-speech frequency of the target participle corresponding to the core relationship in the syntactic dependency relationship;

determining the probability corresponding to the second participle according to the initial weight and the weight factor corresponding to the second participle;

and respectively determining the weight values of the corresponding edges in the error correction map based on the probability corresponding to the first participle and the probability corresponding to the second participle.

if the user is determined to be a non-target user according to the target voiceprint features or the similarity does not exceed the preset threshold, processing the target identification text to obtain a corresponding processing result;

and modifying the processing result by a preset voice error correction method based on the processing result to obtain a modified text.

In a second aspect, the present disclosure provides a speech processing apparatus, comprising:

the recognition module is used for recognizing the voice data to obtain a corresponding target recognition text and a target voiceprint characteristic;

the first determining module is used for determining a target user according to the target voiceprint characteristics;

a second determining module, configured to determine, based on an error correction map corresponding to the target user, target information corresponding to the target identification text, where the error correction map includes: the corresponding relation between the target identification text and the target information;

and the modification module is used for acquiring the similarity between the target identification text and the target information, and modifying the target identification text into the target information if the similarity exceeds a preset threshold value.

As an optional implementation manner of the embodiment of the present disclosure, the modification module includes:

a path information determining unit, configured to determine, based on the error correction map, path information corresponding to the target identification text;

the similarity determining unit is used for determining the similarity between the target recognition text and the target information according to the path information;

and the modifying unit is used for modifying the target identification text into the target information if the similarity exceeds a preset threshold value.

the similarity determination unit is configured to:

the similarity determination unit is further configured to: and determining the similarity between the target recognition text and the target information according to the related probability and the corresponding weight factor.

As an optional implementation manner of the embodiment of the present disclosure, the apparatus further includes: an error correction map establishing module, configured to:

As an optional implementation manner of the embodiment of the present disclosure, the apparatus further includes: a weight value determination module to:

determining the probability corresponding to the first participle according to the frequency corresponding to the syntactic dependency relationship of the first participle or the part-of-speech frequency of a target participle corresponding to the core relationship in the syntactic dependency relationship;

As an optional implementation manner of the embodiment of the present disclosure, the apparatus further includes:

the processing module is used for processing the target recognition text to obtain a corresponding processing result if the target voiceprint feature determines that the user is a non-target user or the similarity does not exceed the preset threshold;

and the text determining module is used for modifying the processing result through a preset voice error correction method based on the processing result to obtain a modified text.

In a third aspect, the present disclosure also provides an electronic device, including:

one or more processors;

a storage device to store one or more programs,

when executed by the one or more processors, the one or more programs cause the one or more processors to implement the speech processing method of any of the embodiments of the present disclosure.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the speech processing method described in any of the embodiments of the present disclosure.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: firstly, voice data is identified to obtain a corresponding target identification text and target voiceprint characteristics, then a target user is determined according to the target voiceprint characteristics, and then target information corresponding to the target identification text is determined based on an error correction map corresponding to the target user, wherein the error correction map comprises: and finally, acquiring the similarity between the target recognition text and the target information, and modifying the target recognition text into the target information if the similarity exceeds a preset threshold.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic diagram of an application scenario of a speech processing procedure in an embodiment of the present disclosure;

fig. 2A is a block diagram of a hardware configuration of an electronic device according to one or more embodiments of the present disclosure;

fig. 2B is a software configuration diagram of an electronic device according to one or more embodiments of the present disclosure;

FIG. 2C is a schematic illustration of an icon control interface display of an application included in a smart device in accordance with one or more embodiments of the present disclosure;

fig. 3A is a schematic flow chart of a speech processing method according to an embodiment of the present disclosure;

FIG. 3B is a schematic diagram illustrating a speech processing method according to an embodiment of the disclosure;

FIG. 4A is a schematic flow chart of another speech processing method according to the embodiment of the present disclosure;

FIG. 4B is a schematic diagram of another speech processing method provided by the disclosed embodiment;

fig. 5A is a schematic flowchart of a method for creating an error correction map according to an embodiment of the present disclosure;

fig. 5B is a schematic diagram illustrating a principle of creating an error correction map according to an embodiment of the disclosure;

FIG. 5C is a diagram illustrating a piece of knowledge in a process of creating an error correction map according to an embodiment of the disclosure;

fig. 5D is a schematic diagram of an error correction map provided in the embodiment of the present disclosure;

fig. 6A is a schematic diagram illustrating a principle of determining weight values of corresponding edges in an error correction map according to an embodiment of the present disclosure;

fig. 6B is a schematic diagram illustrating a principle of determining a probability corresponding to a first participle according to an embodiment of the present disclosure;

fig. 6C is a schematic diagram of a similarity determination method based on an error correction map according to an embodiment of the present disclosure;

fig. 6D is a schematic diagram of another similarity determination method based on an error correction map according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating another speech processing method according to an embodiment of the present disclosure;

fig. 8A is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 8B is a block diagram of a modification module in the speech processing apparatus according to the embodiment of the disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The terms "first" and "second," etc. in this disclosure are used to distinguish between different objects, rather than to describe a particular order of objects. For example, the first path information, the second path information, and the like are for distinguishing different path information, not for describing a specific order of the path information.

With the continuous development of science and technology, various electronic products such as home equipment tend to be intelligent continuously, and great convenience is brought to the life of people. More and more home equipment has voice search and voice control functions, and a user can conveniently and quickly search and quickly control the home equipment. The home equipment identifies voice data of the user, determines the real intention of the user, and executes a subsequent control process based on the real intention. With the development of technologies related to speech and natural language processing, speech recognition and processing are widely applied to various electronic products as a common human-computer interaction technology, and are popular with users in a natural and convenient interaction mode, so that the speech recognition and processing gradually become a mainstream interaction control mode in the era of intelligent products.

Illustratively, according to the user crowd analysis of a big data platform, it is found that preschool children (i.e. children) search more data words of children media assets (i.e. children programs such as cartoons, children entertainment programs and the like) when using a voice search function, and the problems of strong speaking subjectivity and easy error of program names exist in the voice search. Aiming at the problems, a traditional editing distance error correction method or a pinyin similarity error correction method is adopted, the accuracy of text data obtained by processing voice data is not high, and accurate error correction cannot be realized aiming at scenes with wrong program names or other errors of children, so that the touch rate of search services is low.

As can be seen from the above, the conventional speech processing method is not highly accurate, and therefore a speech processing method with higher accuracy is required.

Fig. 1 is a schematic view of an application scenario of a speech processing procedure in an embodiment of the present disclosure. For example, as shown in fig. 1, assuming that the smart devices in the smart home scenario include a smart device 100 (i.e., a smart refrigerator), a smart device 101 (i.e., a smart washing machine), and a smart device 102 (i.e., a smart display device), when a user wants to perform voice search or voice control through the smart device in the home scenario, voice data may be obtained by recording through a recording application in a terminal device 104, where the voice data may be a search intention or a control intention of the user. The terminal device 104 transmits the voice data of the user to the server 103 to cause the server 103 to execute a corresponding voice processing method, that is: the voice data are identified to obtain a corresponding target identification text and a target voiceprint characteristic, a target user is determined according to the target voiceprint characteristic, target information corresponding to the target identification text is determined based on an error correction map corresponding to the target user, the similarity between the target identification text and the target information is obtained, and if the similarity exceeds a preset threshold, the target identification text is modified into the target information. After obtaining the target information, the server 103 sends the target information to the corresponding intelligent device, so that the intelligent device executes the corresponding function. Or the user may enter a sound through the local control device 105, such as a recording module in the internet of things terminal, where the sound may be the user's search intention or control intention. The local control device 105 executes the above-described voice processing method and transmits the obtained target information to the smart device so that the smart device performs a search based on the target information. The voice processing device can be configured in each intelligent device, and the voice processing device executes the voice processing method, so that the aim of voice search or voice control is fulfilled.

It should be noted that: the smart device 101 may also be a tablet, a digital cinema system, or a video server, etc., which are only illustrated in fig. 1 by way of example, and the type and number of the smart devices are not specifically limited.

The speech processing method provided by the embodiment of the disclosure can be implemented based on electronic equipment, or a functional module or a functional entity in the electronic equipment.

The electronic device may be a Personal Computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a mainframe computer, and the like, which is not specifically limited in this disclosure.

Fig. 2A is a block diagram of a hardware configuration of an electronic device according to one or more embodiments of the present disclosure. As shown in fig. 2A, the electronic apparatus includes: at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphic processor, a RAM, a ROM, a first interface to an nth interface for input/output, among others. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio/video signal, such as an EPG audio/video data signal, from a plurality of wireless or wired broadcast television signals. The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The electronic device may establish transmission and reception of control signals and data signals with the server 203 or the local control device 205 through the communicator 220. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box. The user interface 280 may be used to receive control signals for controlling devices, such as infrared remote controls, etc.

In some embodiments, controller 250 controls the operation of the electronic device and responds to user actions through various software control programs stored in memory. The controller 250 controls the overall operation of the electronic device. A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operation and displayed in a graphical manner. The interface element may be an icon, a window, a control, or the like, displayed in a display screen of the electronic device, where the control may include at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, or the like, as a visual interface element.

Fig. 2B is a schematic software configuration diagram of an electronic device according to one or more embodiments of the present disclosure, and as shown in fig. 2B, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer.

In some embodiments, at least one application program runs in the application program layer, and the application programs may be windows (windows) programs carried by an operating system, system setting programs, clock programs or the like; or an application developed by a third party developer. In particular implementations, applications in the application layer include, but are not limited to, the above examples.

In some embodiments, the system runtime layer provides support for the upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software, including at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

Fig. 2C is a schematic diagram illustrating an icon control interface display of an application program included in an intelligent device (mainly, an intelligent playback device, such as an intelligent television, a digital cinema system, or a video server), according to one or more embodiments of the present disclosure, as shown in fig. 2C, an application layer includes at least one application program that can display a corresponding icon control in a display, for example: the system comprises a live television application icon control, a video on demand VOD application icon control, a media center application icon control, an application center icon control, a game application icon control and the like. The live television application program can provide live television through different signal sources. A video on demand VOD application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. The media center application program can provide various played application programs. The application program center can provide and store various application programs.

The voice processing method provided by the embodiment of the application can be realized based on the electronic equipment.

The voice processing method provided by the embodiment of the disclosure obtains a corresponding target recognition text and a target voiceprint feature by recognizing voice data, then determines a target user according to the target voiceprint feature, and then determines target information corresponding to the target recognition text based on an error correction map corresponding to the target user, wherein the error correction map comprises: the method comprises the steps of obtaining the corresponding relation between a target recognition text and target information, finally obtaining the similarity between the target recognition text and the target information, and modifying the target recognition text into the target information if the similarity exceeds a preset threshold value.

For more detailed description of the present solution, the following description is made in conjunction with fig. 3A, and it is understood that the steps involved in fig. 3A may include more steps or fewer steps in actual implementation, and the order between the steps may also be different, so as to enable the speech processing method provided in the embodiment of the present application.

Fig. 3A is a schematic flowchart of a speech processing method according to an embodiment of the disclosure, and fig. 3B is a schematic diagram of a principle of the speech processing method according to the embodiment of the disclosure. The embodiment can be applied to the situations of recognizing the voice data and modifying the target recognition text. The method of the embodiment may be performed by a speech processing apparatus, which may be implemented by hardware and/or software and may be configured in an electronic device.

As shown in fig. 3A, the method specifically includes the following steps:

s310, voice data are identified, and corresponding target identification texts and target voiceprint features are obtained.

The target recognition text can be obtained by recognizing voice data through a voice recognition technology. The target voiceprint characteristics can be obtained by identifying voice data through a voiceprint identification technology. The speech recognition technology is a technology for converting a sound signal into text content, and generally comprises a sound model and a language model, so that the sound is converted into factors, and the optimal text content is obtained through the language model. The present disclosure employs a speech recognition module to convert speech data into a target recognition text. The voiceprint recognition technology is a technology for judging the identity of a user through sound, converts a short-time frequency spectrum of a sound signal into characteristics of Mel-scale frequency Cepstral Coefficients (MFCC), and then outputs the age-class characteristics of the voiceprint by adopting a classification algorithm such as a Support Vector Machine (SVM). The present disclosure employs a voiceprint recognition module to convert voice data into a target recognition text. In this embodiment, the age groups of the voiceprints are mainly divided into four categories, namely, children, young adults, and old people, and then, the process of text modification is described by taking the age group of the voiceprints as the user of the children as an example.

Specifically, after receiving voice data of a user, the voice data can be respectively identified through a voice identification module and a voiceprint identification module to obtain a corresponding target identification text and a target voiceprint characteristic; the voice data can be identified by the module having the voice identification function and the voice print identification function at the same time, so as to obtain the corresponding target identification text and the target voice print characteristics.

And S320, determining a target user according to the target voiceprint characteristics.

The target user may be a user, such as a juvenile user (preschool child), an elderly user, etc., for which preset voice data is prone to error and has a high error rate, and the disclosure is not limited in this respect.

After the target voiceprint feature is obtained, because the target voiceprint feature contains the age characteristics, whether the user corresponding to the voice data is the target user or not can be determined according to the target voiceprint feature.

S330, determining target information corresponding to the target recognition text based on the error correction map corresponding to the target user.

The error correction map may be understood as a pre-established map corresponding to the target user, and may correspond to identification information of a recognition text recognized based on voice data of the target user. The identification information may be title information, picture information, control instruction information, and the like. Assuming that the voice data of the target user contains the title information of the television program, the error correction map may be a multimedia content map; assuming that the voice data of the target user includes control instruction information (e.g., turning on or off), the error correction map may be a control information map. The multimedia content may include a television show, an animation, and a variety program, etc. The error correction map comprises: and the corresponding relation between the target identification text and the target information. The target information can be understood as standard information corresponding to a target identification text, and the target identification text is a wrong expression mode of the target information by a user. For example, assuming that the target identification text is XX drama (the text is an incorrect drama name), the target information is a standard name of the drama.

After the target user is determined, an error correction map corresponding to the target user can be obtained, and target information corresponding to the target recognition text can be determined by querying the error correction map, wherein the target information is content represented by a central node in the error correction map.

And S340, acquiring the similarity between the target identification text and the target information, and modifying the target identification text into the target information if the similarity exceeds a preset threshold.

The preset threshold may be a preset value, and may also be determined according to specific situations, and the disclosure is not limited.

After the target information is determined, the similarity between the target recognition text and the target information can be calculated in a similarity calculation mode, after the similarity is obtained, the similarity is compared with a preset threshold, whether the similarity exceeds the preset threshold can be determined according to the size relation between the similarity and the preset threshold, if the similarity exceeds the preset threshold, the similarity between the target recognition text and the target information is high, and the target recognition text is modified into the target information.

In some embodiments, the obtaining the similarity between the target recognition text and the target information, and after modifying the target recognition text into the target information if the similarity exceeds a preset threshold, further includes:

and taking the target information as a keyword so as to enable the corresponding equipment to execute corresponding operation according to the keyword.

Exemplarily, if the voice data corresponding to the target user is a search intention, the target information is used as a search keyword so that the corresponding device executes a corresponding search operation according to the search keyword; and if the voice data corresponding to the target user is the control intention, the target information is used as a control keyword so that the corresponding equipment executes corresponding control operation according to the control keyword.

In the embodiment, the method can ensure that the subsequent equipment executes corresponding operation, thereby improving the accuracy of the search result or the control process and improving the satisfaction degree of the user.

fig. 4B is a schematic diagram of another speech processing method according to an embodiment of the disclosure. This embodiment mainly describes a process of determining the similarity between the target recognition text and the target information.

As shown in fig. 4A, the method specifically includes the following steps:

and S410, identifying the voice data to obtain a corresponding target identification text and a target voiceprint feature.

And S420, determining a target user according to the target voiceprint characteristics.

And S430, determining target information corresponding to the target recognition text based on the error correction map corresponding to the target user.

And S440, determining path information corresponding to the target recognition text based on the error correction map.

The error correction map includes a center node, each child node, and a corresponding edge (i.e., a connection relationship).

After the target information corresponding to the target identification text is determined, the connection relationship between the central node and each sub-node and the connection relationship between different sub-nodes in the error correction map are inquired, so that each target sub-node for generating the target identification text can be determined, and the path information corresponding to the target identification text can be determined according to the connection relationship between each target sub-node or the connection relationship between the central node and each target sub-node.

S450, according to the path information, determining the similarity between the target recognition text and the target information, and if the similarity exceeds a preset threshold, modifying the target recognition text into the target information.

After the path information corresponding to the target recognition text is determined, the similarity between the target recognition text and the target information can be calculated based on the weight value of the corresponding edge in the path information, and if the similarity exceeds a preset threshold, the target recognition text is modified into the target information.

In the embodiment, the path information corresponding to the target recognition text is determined based on the error correction map, and the similarity between the target recognition text and the target information is determined according to the path information.

Fig. 5A is a schematic flowchart of a method for creating an error correction map according to an embodiment of the present disclosure; fig. 5B is a schematic diagram illustrating a principle of establishing an error correction map according to an embodiment of the present disclosure. The present embodiment mainly describes a process of establishing an error correction map.

As shown in fig. 5A, the method specifically includes the following steps:

s510, acquiring a first participle contained in the corpus information related to the target information and a second participle contained in the label information corresponding to the target information.

The corpus information includes target information and text information corresponding to an error expression mode related to the target information. The tag information is expanded according to the target information, and if the target information is the name of the multimedia content, the tag information can be a role object, a role name, a role type, a role attribute and the like in the multimedia content; if the target information is control information for the smart device, the tag information may be a name of the smart device, a function of the smart device, and the like, which is not limited in this disclosure.

Specifically, all first participles contained in the text content can be obtained by performing participle tagging on all text content contained in the text information, and a second participle contained in the label information can be obtained by performing participle tagging on the label information corresponding to the target information.

For example, assuming that the corpus information is "ABCD" and "CBAD", the first participle after participle tagging may be: "A", "B", "CD", "CB", and "D"; assuming that the label information is a Role Name (Role Name) and a Role Object (Role Object) corresponding to "ABCD" (some animation Name), the second participle after participle tagging may be: "a", "b", "c", "d", "e", and "f", etc.

S520, establishing an error correction map by taking the target information as a center node, the first word segmentation, the second word segmentation and the generation information as child nodes, taking a first incidence relation between the center node and different child nodes as an edge between the center node and different child nodes, and taking a second incidence relation between the child nodes as an edge between the child nodes.

The map is a semantic network for revealing the relation between the entities, the map is composed of a piece of knowledge, each piece of knowledge can represent a triple of a main subject and a predicate, the main subject and the predicate are represented by nodes, the name of the entity is described, the predicate is represented by edges, and the action of the entity is described. The generated information may be understood as text information corresponding to a wrong expression mode related to the target information in the corpus information, and may be generated based on the first participle, or based on the second participle, or may be generated based on both the first participle and the second participle, where a specific generation mode is determined according to a specific situation. The first incidence relation is a splitting relation, and the second incidence relation is a generating relation.

It should be noted that: each edge is provided with directional information.

For example, the example of the target information being the program name is extended, and the obtained tag information and the generation information may be as shown in table 1 below.

TABLE 1

After the first participle and the second participle are obtained, the target information is used as a center node, the first participle, the second participle and the generation information are used as child nodes, the splitting relation between the center node and different child nodes is used as an edge between the center node and different child nodes, and the generation relation between the child nodes is used as an edge between the child nodes, so that an error correction map can be established.

In the embodiment of the disclosure, because the corpus information contains the text information corresponding to the error expression mode related to the target information, the error correction map established by the method is more comprehensive, has stronger practicability, and is beneficial to subsequently modifying the target identification text into the target information.

For example, fig. 5C is a schematic diagram of a piece of knowledge in a process of establishing an error correction map according to an embodiment of the present disclosure. Wherein, the node 1 and the node 2 are entity names, and the edge pointing from the node 1 to the node 2 represents the node relationship of the node 1 and the node 2 (relationship, which is later referred to as rel for substitution).

Exemplarily, fig. 5D is a schematic diagram of an error correction map provided in an embodiment of the present disclosure, as shown in fig. 5D: and by taking the ABCD as target information and the target user as a child user, establishing the corresponding error correction map by taking the ABCD as a central node, taking the first participle, the second participle and the generated information related to the ABCD as child nodes and corresponding edges with direction information through the process of establishing the error correction map.

In some embodiments, the method further comprises:

The syntactic dependency relationship may include a predicate relationship (Subject-Verb, SBV), a move-object relationship (Verb-Bbject, VOB), a mediate-object relationship (POV), a centering relationship (atttribute, ATT), an Argument (ADV), a move-Complement relationship (CMP), a parallel relationship (Coordinate, COO), a core relationship (Head, HED), and the like. The initial weight and the weight factor may be set by a user, or may be determined according to a specific situation, and the disclosure is not limited. For example, the initial weight may be set to a value of 0.7 or 0.8, etc. The weight factor can be set according to the priority of the second participle, if the priority of the second participle is high, the corresponding weight factor is set to be a larger numerical value, such as 0.9; if the priority of the second participle is low, the corresponding weight factor is set to a smaller value, such as 0.2.

Specifically, the syntactic dependency relationship of each first participle can be obtained by performing syntactic dependency analysis on all the first participles.

For example, the table of information obtained by performing word segmentation tagging on target information (for example, ABCD and EFGH in program names) and performing syntactic dependency analysis on segmented words obtained after word segmentation tagging may be as shown in table 2 below.

TABLE 2

Wherein text in table 2 represents text corresponding to the target information; items represents the number of items; deprel represents a dependency relationship; posag denotes part-of-speech-tagging; id represents the position corresponding to the current participle; word represents the current participle.

For example, the information table obtained by performing word segmentation tagging on some text information (in terms of program names) corresponding to the error expression mode related to the target information and performing syntactic dependency analysis on the segmented words obtained after the word segmentation tagging may be as shown in table 3 below.

TABLE 3

The frequency to which syntactic dependencies correspond may be determined by: the occurrence frequency of each syntactic dependency in the syntactic dependencies corresponding to all the first tokens respectively is counted, and normalization processing is performed to construct a syntactic dependency frequency mapping table, as shown in table 4 below. Specifically, the frequency of occurrence of each syntactic dependency in all the syntactic dependency analyses is determined, and frequency calculation is performed to obtain the frequency of occurrence (freq) corresponding to each syntactic dependency in the syntactic dependency analyses ₁ ) (ii) a Then, the occurrence frequency corresponding to each syntactic dependency relationship is compared with the maximum value (freq) of all the occurrence frequencies _1max ) And minimum value (freq) _1min ) The sum of the two is used as a ratio to obtain a corresponding normalized frequency, and the normalized frequency is used as a frequency P corresponding to each syntactic dependency relationship _dep As shown in equation 1.

TABLE 4

Syntactic dependency	Frequency of occurrence	Normalized frequency
			SBV	0.154	0.440
VOB	0.125	0.357
			POV	0.046	0.131
ATT	0.348	0.994
			ADV	0.038	0.109
CMP	0.035	0.100

The part-of-speech frequency of the target participle corresponding to the core relationship in the syntactic dependency relationship may be determined by counting the probability of occurrence of the part-of-speech tag of the target participle corresponding to the core relationship in the syntactic dependency relationship corresponding to each of all the first participles, and performing normalization processing to construct a core part-of-speech probability table, as shown in table 5 below. Specifically, the frequency of occurrence of the part-of-speech corresponding to each target word in the target word is counted, and frequency calculation is performed to obtain the frequency of occurrence (freq) corresponding to each part-of-speech ₂ ) The frequency of occurrence corresponding to each part of speech is determined by the maximum frequency of occurrence (f)req _2max ) And minimum value (freq) _2min ) The sum of the two is used as a ratio to obtain a corresponding normalized probability which is used as a corresponding core word part-of-speech probability P _core As shown in equation 2.

TABLE 5

Label (R)	Means of	Frequency of occurrence	Normalized probability
				n	Common noun	0.386	0.990
a	Adjectives	0.065	0.167
				v	Common verb	0.142	0.364
m	Number word	0.004	0.010
				c	Conjunction word	0.006	0.015

After the syntactic dependency frequency mapping table (table 4) and the core word probability table (table 5) are constructed, the probability corresponding to the first word is obtained by referring to the table 4 or the table 5. The initial weight corresponding to the second participle is multiplied by the weight factor, so that the probability corresponding to the second participle can be determined. After the probability corresponding to the first participle and the probability corresponding to the second participle are obtained, the weight values of the edges between the center node and different child nodes in the error correction map can be determined based on the probability corresponding to the first participle, and the weight values of the edges between the child nodes in the error correction map can be respectively determined based on the probability corresponding to the second participle.

For example, taking the error correction map shown in fig. 5D as an example, the weight value corresponding to each edge in the error correction map is determined by the above method, so as to obtain a path conversion look-up table, as shown in the following table 6:

TABLE 6

Starting point	Type of path	Edge	End point	Weighted value
					ABCD	Splitting paths	ATT	A	0.994
ABCD	Splitting paths	ATT	B	0.994
					ABCD	Splitting paths	HED	CD	0.99
ABCD	Splitting paths	Role Name	a	0.64
					ABCD	Splitting paths	Role Name	b	0.72
ABCD	Splitting paths	Role Object	c	0.8
					ABCD	Splitting paths	Role Object	d	0.8
ABCD	Splitting paths	Role Object	e	0.8
					ABCD	Splitting paths	Role Object	f	0.7
A	Generating a path	ATT	B	0.994
					B	Generating a path	ATT	c	0.994
c	Generating a path	HED	ABc	0.99
					b	Generating a path	ATT	CD	0.994
CD	Generating a path	HED	bCD	0.99
					d	Generating a path	ATT	A	0.994
A	Generating a path	ATT	f	0.994
					f	Generating a path	HED	dAf	0.99

The split path in table 6 indicates a path (may also be referred to as a path corresponding to the first association relationship) between the nodes connected from the central node to different child nodes, and the generated path indicates a connection path (may also be referred to as a path corresponding to the second association relationship) between the child nodes. The end points of the edges of the split path, namely the Role Name and the Role Object, are second participles, and the corresponding weight values of the second participles are determined by multiplying the initial weight and the weight factor corresponding to the second participles.

Wherein, in the split path of table 6: the second participle with the end point of "a" has an initial weight of 0.8 and a weight factor of 0.8; the second participle with the end point of "b", the initial weight of 0.8, the weight factor of 0.9; a second participle with an end point of "c", an initial weight of 0.8, and a weight factor of 1; a second participle ending at point "d", with an initial weight of 0.8 and a weight factor of 1; a second participle ending with "e" and having an initial weight of 0.8 and a weight factor of 1; the second participle ending with "f" has an initial weight of 0.7 and a weight factor of 1.

For example, fig. 6A is a schematic diagram illustrating a principle of determining a weight value of a corresponding edge in an error correction map according to an embodiment of the present disclosure, and steps corresponding to fig. 6A have been described in the foregoing embodiment, and are not described herein again.

In some embodiments, the determining the probability that the first participle corresponds comprises:

determining the probability corresponding to the core word based on the part-of-speech frequency of the core word when the first participle is determined to be the core word according to the syntactic dependency relationship of the first participle;

and when the first participle is determined to be a non-core word according to the syntactic dependency relationship of the first participle, determining the probability corresponding to the non-core word based on the frequency corresponding to the syntactic dependency relationship.

Specifically, when the first word is determined to be the core word according to the syntactic dependency relationship of the first word, based on the part-of-speech frequency of the core word, the table 5 may be specifically queried to determine the probability corresponding to the core word; when the first participle is determined to be a non-core word according to the syntactic dependency relationship of the first participle, based on the frequency corresponding to the syntactic dependency relationship, the table 4 may be specifically queried to determine the probability corresponding to the non-core word.

In this embodiment, since the core word is important, the probability of the corresponding first participle is determined according to the part-of-speech frequency of the core word, and the probabilities corresponding to the first participle are determined according to the two cases, which is beneficial to improving the accuracy and more suitable for the actual situation.

For example, fig. 6B is a schematic diagram illustrating a principle of determining a probability corresponding to a first word segmentation provided in the embodiment of the present disclosure, and steps corresponding to fig. 6B have been described in the above embodiment, and are not described again here.

In some embodiments, the path information comprises: the target identification text respectively corresponds to first path information and second path information under different path types;

the determining the similarity between the target recognition text and the target information according to the path information may specifically include:

and determining the similarity of the target identification text and the target information according to the first probability value and the second probability value.

The path type may include a split path and a generation path. The split path may be understood as a path connecting from the central node to different child nodes (which may also be referred to as a path corresponding to the first association), and the generated path may be understood as a connection path between the child nodes (which may also be referred to as a path corresponding to the second association). When the first path type is a split path, the second path type is a generation path; and when the first path type is the generation path, the second path type is the splitting path. The first path information may include child nodes included under the first path type. The second path information may include child nodes included under the second path type.

Specifically, based on the error correction map, the first path information and the second path information respectively corresponding to the target recognition text under different path types can be determined. Based on the first path information, by querying a path conversion table (table 6) corresponding to the error correction map, first probability values respectively corresponding to child nodes (i.e., participles) included under the first path type can be determined, and based on the second path information, by querying a path conversion table (table 6) corresponding to the error correction map, second probability values respectively corresponding to child nodes included under the second path type can be determined. After the first probability value and the second probability value are obtained, the similarity between the target recognition text and the target information can be determined according to the first probability value and the second probability value through the following formula.

Wherein S is _NT The similarity between the target recognition text and the target information is represented, and the first part is a first probability value corresponding to the target recognition text under the split path, namely: splitting a target recognition text to obtain weighted sums of weight values corresponding to the child nodes; p is a radical of _i The weight value corresponding to the ith participle is obtained by a look-up table 6; q. q.s _i The proportion of the segmentation length of the ith segmentation to the total length of the target recognition text is represented; the second part is a second probability value corresponding to the destination identification text under the generated path, namely: weighted summation of weight values corresponding to all child nodes is carried out when the target recognition text is generated; p is a radical of _j The weight value corresponding to the jth participle is obtained by a look-up table 6; q. q.s _j The participle length representing the jth participle accounts for the total length of the target recognition text. split means split and generic means generate. i and j are determined according to the splitting path and the generating path of the target recognition text.

Illustratively, text is recognized by taking 'dAf' as a target, and a split path is'd-A-f'; the generation path is "d-A-f".

For example, fig. 6C is a schematic diagram of a similarity determining method provided in the embodiment of the present disclosure, and steps corresponding to fig. 6C have been described in the above embodiment, and are not repeated here.

Illustratively, table 7 is an example of determining the similarity by the above method, as shown in table 7 below:

TABLE 7

In the embodiment, the similarity is determined through the method, and the accuracy is high.

In some embodiments, the path information comprises: the target recognition text is in a corresponding path, the related probability of the participles represented by each child node and the target recognition text and the weight factor corresponding to the participles represented by each child node are obtained;

and determining the similarity between the target recognition text and the target information according to the related probability and the corresponding weight factor.

Specifically, based on the target graph, the correlation probability (i.e., the corresponding weight value) between the participle represented by each child node of the target recognition text in the generated path (or the split path) and the weight factor corresponding to the participle represented by each child node (the weight factor may be set by user) can be determined, and the similarity between the target recognition text and the target information can be determined by performing weighted summation on each correlation probability and the corresponding weight factor.

For example, fig. 6D is a schematic diagram of another similarity determining method provided in the embodiment of the present disclosure, and steps corresponding to fig. 6D have been described in the above embodiment, and are not repeated here.

In the embodiment, the similarity is determined by the method, so that the method is simple and quick.

In some embodiments, the method further comprises: if the user is determined to be a non-target user according to the target voiceprint features or the similarity does not exceed the preset threshold, processing the target identification text to obtain a corresponding processing result;

The preset speech error correction method may include an edit distance error correction method, a pinyin similarity error correction method, and the like, which is not limited in this embodiment.

Specifically, if the user is determined to be a non-target user according to the target voiceprint characteristics or the similarity does not exceed a preset threshold, word segmentation and labeling are performed on the target recognition text to obtain a corresponding processing result, and the processing result is modified through a preset voice error correction method based on the processing result, so that the modified text can be obtained.

Optionally, if the target information corresponding to the target recognition text cannot be determined, the target recognition text may also be processed to obtain a corresponding processing result, and the processing result is modified by a preset voice error correction method based on the processing result to obtain a modified text.

In this embodiment, the method is modified to further improve the coverage of the speech processing method, and ensure that speech processing can be performed under various conditions.

Fig. 7 is a flowchart illustrating another speech processing method according to an embodiment of the disclosure. The present embodiment can be applied to the explanation of the entire voice processing procedure.

As shown in fig. 7, the method specifically includes the following steps:

and S7001, recognizing the voice data to obtain a corresponding target recognition text and a target voiceprint feature.

And S7002, determining whether the user is the target user according to the target voiceprint characteristics.

If yes, executing S7003; if not, S7007-S7008 are executed.

And S7003, determining target information corresponding to the target recognition text based on the error correction map corresponding to the target user.

And S7004, acquiring the similarity between the target recognition text and the target information.

And S7005, determining whether the similarity exceeds a preset threshold value.

If yes, executing S7006; if not, S7007-S7008 are executed.

And S7006, modifying the target identification text into target information.

And S7007, processing the target recognition text to obtain a corresponding processing result.

And S7008, modifying the processing result through a preset voice error correction method based on the processing result to obtain a modified text.

Fig. 8A is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure; the device is configured in the electronic equipment, and can realize the voice processing method in any embodiment of the application. The device specifically comprises the following steps:

the recognition module 801 is configured to recognize the voice data to obtain a corresponding target recognition text and a corresponding target voiceprint feature;

a first determining module 802, configured to determine a target user according to the target voiceprint feature;

a second determining module 803, configured to determine, based on an error correction map corresponding to the target user, target information corresponding to the target identification text, where the error correction map includes: the corresponding relation between the target identification text and the target information;

a modifying module 804, configured to obtain a similarity between the target identification text and the target information, and modify the target identification text into the target information if the similarity exceeds a preset threshold.

Fig. 8B is a schematic structural diagram of a modification module in the speech processing apparatus according to the embodiment of the disclosure, and as shown in fig. 8B, the modification module 804 includes:

a path information determining unit 8041, configured to determine, based on the error correction map, path information corresponding to the target identification text;

a similarity determining unit 8042, configured to determine, according to the path information, a similarity between the target identification text and the target information;

a modifying unit 8043, configured to modify the target identification text into the target information if the similarity exceeds a preset threshold.

the similarity determination unit is configured to:

and respectively determining the weight values of corresponding edges in the error correction map based on the probability corresponding to the first participle and the probability corresponding to the second participle.

The speech processing apparatus provided in the embodiment of the present disclosure can execute the speech processing method provided in any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method, and in order to avoid repetition, details are not repeated here.

An embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the speech processing method of any of the embodiments of the present disclosure.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 9, the electronic device includes a processor 910 and a storage 920; the number of the processors 910 in the electronic device may be one or more, and one processor 910 is taken as an example in fig. 9; the processor 910 and the storage 920 in the electronic device may be connected by a bus or other means, and fig. 9 illustrates the connection by the bus as an example.

The storage 920 is a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the voice processing method in the embodiments of the present disclosure. The processor 910 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the storage 920, that is, implements the voice processing method provided by the embodiment of the present disclosure.

The storage 920 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, the storage 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 920 may further include memory located remotely from the processor 910, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device provided by this embodiment can be used to execute the voice processing method provided by any of the above embodiments, and has corresponding functions and advantages.

The embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process executed by the foregoing speech processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the foregoing discussion in some embodiments is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method of speech processing, the method comprising:

determining a target user according to the target voiceprint characteristics;

2. The method according to claim 1, wherein the obtaining of the similarity between the target recognition text and the target information comprises:

3. The method of claim 2, wherein the path information comprises: the target identification text respectively corresponds to first path information and second path information under different path types;

4. The method of claim 2, wherein the path information comprises: the target recognition text is in a corresponding path, the related probability of the participles represented by each child node and the target recognition text and the weight factor corresponding to the participles represented by each child node are obtained;

5. The method of claim 1, further comprising:

6. The method of claim 5, further comprising:

7. The method of any one of claims 1-6, further comprising:

8. A speech processing apparatus, characterized in that the apparatus comprises:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.