CN117672184A - Data processing method, device, equipment, storage medium and program product - Google Patents

Data processing method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN117672184A
CN117672184A CN202211027484.XA CN202211027484A CN117672184A CN 117672184 A CN117672184 A CN 117672184A CN 202211027484 A CN202211027484 A CN 202211027484A CN 117672184 A CN117672184 A CN 117672184A
Authority
CN
China
Prior art keywords
data
music
sample
phoneme
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211027484.XA
Other languages
Chinese (zh)
Inventor
谭维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211027484.XA priority Critical patent/CN117672184A/en
Publication of CN117672184A publication Critical patent/CN117672184A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The embodiment of the application discloses a data processing method, a device, equipment, a storage medium and a program product, which can be applied to an artificial intelligence scene, and the method comprises the following steps: acquiring music dry sound data in music to be identified, and respectively extracting music rhythm data and music audio frame data in the music dry sound data; obtaining a phone start-stop time set associated with the N phones based on the music tempo data and the phone state parameters; determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set; based on dictionary data for music to be identified, M candidate texts corresponding to the music phoneme sequences are acquired, and music text data corresponding to the music to be identified is determined from the M candidate texts. By adopting the embodiment of the application, the accuracy of music identification can be improved.

Description

Data processing method, device, equipment, storage medium and program product
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a data processing method, apparatus, device, storage medium, and program product.
Background
In the audio recognition scene, the existing audio recognition method often performs audio recognition through frame data of audio data so as to predict text data corresponding to the audio data. However, in a scene of audio recognition of music to be recognized, the music to be recognized often has background sound, and the singing mode is different from the daily speaking mode, so that text data obtained after the music to be recognized is recognized by adopting the existing audio recognition mode may have errors, so that the accuracy of audio recognition is reduced.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, equipment, a storage medium and a program product, which can improve the accuracy of music identification.
In one aspect, an embodiment of the present application provides a data processing method, including:
acquiring music dry sound data in music to be identified, and respectively extracting music rhythm data and music audio frame data in the music dry sound data;
based on the music rhythm data and the phoneme state parameters, carrying out state alignment processing on the music audio frame data to obtain a phoneme starting and ending time set associated with N phonemes; n is a positive integer;
Determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set;
based on dictionary data of music to be identified, M candidate texts corresponding to music phoneme sequences are obtained, and based on the acoustic feature probability of the music and text sequence probabilities corresponding to the M candidate texts, music text data corresponding to the music to be identified are determined from the M candidate texts; m is a positive integer.
An aspect of an embodiment of the present application further provides a data processing method, including:
when sample data comprising sample audio data and sample text data is obtained, respectively extracting sample rhythm data, sample audio frame data and sample pitch data from sample dry sound data in the sample audio data; sample audio data carries a sample tag; the sample tag is used for representing actual text data corresponding to the sample audio data;
dictionary data in the initial audio recognition model is obtained, and phoneme conversion processing is carried out on the basis of actual text data, dictionary data and sample pitch data to obtain a sample phoneme string;
based on the sample rhythm data and the phoneme state parameters, carrying out state alignment processing on the sample audio frame data to obtain a sample start-stop time set associated with the sample phoneme string;
Determining a sample acoustic feature probability of a sample phoneme sequence associated with the sample phoneme string based on the sample acoustic feature and the sample start-stop time set corresponding to the sample audio frame data;
based on sample text data, dictionary data and sample acoustic feature probability of the sample phoneme sequence, obtaining predicted text data corresponding to the sample phoneme sequence;
training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model; the music audio recognition model is used for predicting music text data of music to be recognized.
An aspect of an embodiment of the present application provides a data processing apparatus, including:
the system comprises a dry sound data acquisition module, a music audio frame data acquisition module and a music data processing module, wherein the dry sound data acquisition module is used for acquiring music dry sound data in music to be identified and respectively extracting music rhythm data and music audio frame data in the music dry sound data;
the music state alignment module is used for carrying out state alignment processing on the music audio frame data based on the music rhythm data and the phoneme state parameters to obtain a phoneme starting and ending time set associated with the N phonemes; n is a positive integer;
a feature probability determining module, configured to determine a musical acoustic feature probability of a musical phoneme sequence associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set;
The text data determining module is used for acquiring M candidate texts corresponding to the music phoneme sequences based on dictionary data of the music to be identified, and determining music text data corresponding to the music to be identified from the M candidate texts based on the music acoustic feature probability and text sequence probabilities respectively corresponding to the M candidate texts; m is a positive integer.
Wherein the music rhythm data is composed of P pitches; p is a positive integer less than or equal to N; n is the total number of phonemes corresponding to P pitches;
the music state alignment module includes:
the initial alignment unit is used for carrying out initial alignment processing on the music audio frame data based on P pitch and phoneme state parameters to obtain first alignment data; the first alignment data is used for indicating a first starting time corresponding to each of the N phonemes; the music audio frame data includes audio frames V i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer greater than or equal to Q; q is the number of audio frames corresponding to the music audio frame data;
a comprehensive probability acquisition unit for acquiring an audio frame V based on the first alignment data i The corresponding state comprehensive probability; the state integrated probability is determined by the audio frame V i Corresponding state transition probabilities and audio frames V i The corresponding state emission probability is determined;
the adjusting and aligning unit is used for adjusting and aligning the first alignment data to obtain second alignment data when the state comprehensive probability corresponding to each audio frame is obtained;
and the time set acquisition unit is used for acquiring second starting and ending time corresponding to each of the N phonemes from the second alignment data, and obtaining a phoneme starting and ending time set associated with the N phonemes based on the N second starting and ending times.
Wherein the initial alignment unit includes:
frame number determination subunit for obtaining pitch Y from P pitches j Determining pitch Y j Pitch start-stop frame number in the music audio frame data; j is a positive integer less than or equal to P;
a time determination subunit for determining a pitch Y based on the phoneme state parameter j The corresponding number of phonemes, the pitch Y is determined from the number of pitch start-stop frames j A first start time corresponding to each phoneme in the set;
an alignment data determination subunit for determining first alignment data corresponding to the music audio frame based on a first start time corresponding to each of N phonemes associated with the P pitches.
Wherein the musical dry sound data is determined based on a business acoustic model in the musical audio recognition model; the business acoustic model comprises a first sub-model and a second sub-model;
A feature probability determination model, comprising:
the feature extraction unit is used for carrying out feature extraction processing on the music audio frame data based on the phoneme starting and ending time set to obtain music acoustic features;
the phoneme recognition unit is used for inputting the acoustic features of the music into the first sub-model, and performing phoneme recognition processing on the acoustic features of the music by the first sub-model to obtain a sequence probability corresponding to the initial phoneme sequence;
a transition probability determining unit, configured to determine a phoneme transition probability corresponding to the initial phoneme sequence based on the phoneme start-stop time set and the second sub-model;
a phoneme conversion unit, configured to perform phoneme conversion processing on the initial phoneme sequence based on the phoneme conversion probability, to obtain a music phoneme sequence associated with the N phonemes;
and a feature determining unit for determining musical acoustic feature probabilities of the musical phoneme sequences based on the phoneme sequence probabilities and the phoneme transition probabilities of the musical phoneme sequences.
Wherein, text data confirms the module, include:
a text acquisition unit for acquiring M candidate texts corresponding to the music phoneme sequence based on dictionary data when dictionary data for music to be identified is acquired from the music audio recognition model;
The text input unit is used for inputting M candidate texts into a business language model in the music audio recognition model, and outputting text sequence probabilities corresponding to the M candidate texts respectively by the business language model;
the matching probability acquisition unit is used for acquiring text matching probabilities corresponding to the M candidate texts respectively based on the music acoustic feature probabilities and the text sequence probabilities corresponding to the M candidate texts respectively;
the text determining unit is used for acquiring the highest text matching probability from the M text matching probabilities, and taking the candidate text corresponding to the highest text matching probability as music text data corresponding to the music to be identified.
An aspect of an embodiment of the present application provides a data processing apparatus, including:
a sample audio acquisition module, configured to, when sample data including sample audio data and sample text data is acquired, respectively extract sample rhythm data, sample audio frame data, and sample pitch data from sample dry sound data in the sample audio data; sample audio data carries a sample tag; the sample tag is used for representing actual text data corresponding to the sample audio data;
the phoneme string acquisition module is used for acquiring dictionary data in the initial audio recognition model, and performing phoneme conversion processing based on the actual text data, the dictionary data and the sample pitch data to obtain a sample phoneme string;
The sample state alignment module is used for carrying out state alignment processing on the sample audio frame data based on the sample rhythm data and the phoneme state parameters to obtain a sample start-stop time set associated with the sample phoneme string;
a sample probability determining module, configured to determine a sample acoustic feature probability of a sample phoneme sequence associated with the sample phoneme string based on a sample acoustic feature corresponding to the sample audio frame data and a sample start-stop time set;
the predicted text acquisition module is used for acquiring predicted text data corresponding to the sample phoneme sequence based on the sample text data, the dictionary data and the sample acoustic feature probability of the sample phoneme sequence;
the model training module is used for training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model; the music audio recognition model is used for predicting music text data of music to be recognized.
Wherein, phoneme string acquisition module includes:
the phoneme string determining unit is used for acquiring dictionary data in the initial audio recognition model, performing phoneme conversion processing on the actual text data based on the dictionary data, and determining an initial phoneme string corresponding to the sample pitch data; the initial phone string carries a first tone;
A parameter determining unit, configured to obtain a pitch modification rule that matches an audio type of the sample audio data, and determine a pitch modification parameter corresponding to the initial phoneme string in the pitch modification rule based on a pitch frequency interval to which the initial phoneme string belongs;
and a pitch changing unit configured to change the first pitch to the second pitch based on the pitch changing parameter, and determine an initial phoneme string having the second pitch as a sample phoneme string.
Wherein the sample text data comprises original text data and lyric text data;
model training a model comprising:
the language model acquisition unit is used for determining a first model loss of an initial language model in the initial audio recognition model based on the original text data and the lyric text data, and training the initial language model based on the first model loss to obtain a business language model;
the acoustic model acquisition unit is used for determining second model loss of the initial acoustic model in the initial audio recognition model based on the actual text data and the predicted text data, and training the initial acoustic model based on the second model loss to obtain a business acoustic model;
and the music model determining unit is used for taking an initial audio frequency identification model comprising a service language model and a service acoustic model as a music audio frequency identification model.
In one aspect, the present application provides a computer device comprising: a processor, a memory, a network interface;
the processor is connected to the memory and the network interface, where the network interface is used to provide a data communication function, the memory is used to store a computer program, and the processor is used to call the computer program to make the computer device execute the method in the embodiment of the present application.
In one aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, the computer program being adapted to be loaded by a processor and to perform a method according to embodiments of the present application.
In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods in the embodiments of the present application.
In this embodiment of the present invention, when music to be identified is acquired by a computer device having a music identification function, music rhythm data and music audio frame data for performing state alignment processing may be extracted from music stem data of the music to be identified, so as to obtain a more accurate phone start-stop time set, and further, in a subsequent step, according to the music acoustic feature corresponding to the phone start-stop time set and the music audio frame data, a more accurate music acoustic feature probability corresponding to a music phone sequence may be obtained, and further, when M candidate texts corresponding to the music phone sequence are acquired subsequently, M is a positive integer, and when audio identification is performed through text sequence probabilities corresponding to the M candidate texts, accuracy of audio identification may be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;
fig. 2 is a schematic view of a scenario for music recognition according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;
fig. 4 is a schematic view of a scenario for performing a state alignment process according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 6 is a schematic view of a scenario for model training provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a data processing system according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The embodiment of the application provides a music recognition method based on a music audio recognition model, and the method relates to the field of artificial intelligence. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology (voice technology), a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology and speech synthesis technology, and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks (e.g., music audio recognition models that have been trained), belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system may include a server 10F and a terminal cluster, which may include one or more terminal devices, and the number of terminal devices is not limited in this application. As shown in fig. 1, specifically may include: the terminal devices 100a, 100b, 100c, …, 100n may respectively perform network connection with the server 10F as shown in fig. 1, so that each terminal device may perform data interaction with the server 10F through the network connection. The network connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may be directly or indirectly connected through a wireless communication manner, or may be other manners, which is not limited herein.
Wherein, each terminal device in the terminal cluster may include: smart terminals with music recognition function such as smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart televisions and the like. It should be understood that each terminal device in the terminal cluster shown in fig. 1 may be provided with an application client, and when the application client runs in each terminal device, data interaction may be performed between the application client and the server 10F shown in fig. 1. The application client may include a social client, a multimedia client (e.g., a music client), an entertainment client (e.g., a game client), an educational client, a live client, etc. application clients having a music recognition function. The application client may be an independent client, or may be an embedded sub-client integrated in a client (for example, a social client, an educational client, and a multimedia client), which is not limited herein. Among them, there may be a communication connection between the terminal clusters, for example, a communication connection (performing data transmission and interaction) between the terminal device 100a and the terminal device 100b, and a communication connection (performing data transmission and interaction) between the terminal device 100a and the terminal device 100 c.
As shown in fig. 1, the server 10F in the embodiment of the present application may be a server corresponding to the application client, where the server 10F may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud computing services.
For ease of understanding, the embodiment of the present application may select one terminal device from the plurality of terminal devices shown in fig. 1 as the terminal device for music recognition. For example, the embodiment of the present application may select the terminal device 100a shown in fig. 1 as an object terminal device for performing music recognition processing, in which an application client may be integrated. At this time, the object terminal device may implement data interaction between the application client and the server 10F. The application client can run a trained music audio recognition model, which is a neural network model for predicting music text data of music to be recognized.
It should be understood that, in the embodiment of the present application, the music to be identified acquired by the computer device having the music identification function (for example, the server 10F or the object terminal device shown in fig. 1) may be the music that needs to be identified currently, for example, if the application client accessed by the service object (i.e., the user) on the object terminal device is a video client, the music to be identified may be the music audio data intercepted in the multimedia data (for example, the song data on a certain music program) being played in the video client. For another example, if the application client accessed by the service object on the object terminal device is a music client, the music to be identified may be music audio data played by another device (for example, a television) collected by the object terminal device. For another example, the music to be identified may also be music audio data sung by the service object directly collected by the object terminal device. Of course, the music to be identified may also be audio data in other scenes, and the music to be identified will not be exemplified here one by one.
The computer equipment can extract music rhythm data and music audio frame data from music dry sound data in music to be identified so as to obtain more accurate music text data later, wherein the music dry sound data can be pure voice data obtained by separating sound sources of the music to be identified and stripping background music such as accompaniment from the music to be identified by the computer equipment. The music rhythm data may be data composed of P pitches extracted from the music dry sound data by the computer device through an industry standard electronic communication protocol in the music field, where P is a positive integer, where one pitch in the embodiment of the present application may correspond to one character, for example, one pitch corresponds to one word or one word, and the music rhythm data may represent various notes or playing codes defined by a playing device such as an electronic musical instrument, for example, musical instrument digital interface (MIDI, musical Instrument Digital Interface) data. In short, the music tempo data may include beat data of music, which is understood in the time domain. The music audio frame data may be data obtained by performing framing processing on the music dry sound data.
For the convenience of subsequent understanding and explanation, please refer to fig. 2, fig. 2 is a schematic view of a scene for music recognition according to an embodiment of the present application. As shown in fig. 2, the computer device in the embodiment of the present application may target the music to be identified (e.g., the music to be identified 2S shown in fig. 2 1 ) The identification may be performed by the server 10F in the embodiment corresponding to fig. 1, or may be any one of the terminal devices in the terminal cluster, for example, the terminal device 100a, which will not be limited herein.
Wherein the music audio recognition model (e.g., audio recognition model 2000A shown in fig. 2) running on the computer device may include a business acoustic model (e.g., acoustic model 200B shown in fig. 2) and a business language model (e.g., language model 200A shown in fig. 2), wherein acoustic model 200B is composed of first sub-model 20B 1 (e.g., a frame-to-monophonic classification model) and a second sub-model 20B 2 (e.g., a phoneme transition probability model), the acoustic model 200B may be used to predict a certain phoneme sequence (e.g., the phoneme sequence 2S shown in fig. 2) 6 ) Musical acoustic feature probabilities of (a). The language model 200A is configured to obtain a text sequence probability corresponding to each candidate text in the candidate text set. Wherein, waiting for The selected text set may include M candidate texts (M is a positive integer), and the number of candidate texts in fig. 2 may be two, i.e., the candidate text set may include candidate text 2S 71 With candidate text 2S 72
Wherein the computer device can acquire the music 2S to be identified 1 At this time, music 2S can be treated as identified 1 Performing sound source separation processing to obtain background music from the music to be identified 2S 1 Middle stripping to obtain dry sound data 2S 2 (i.e., musical dry sound data) and can be further processed by 2S for the dry sound data 2 Extracting to obtain rhythm data 2S from computer equipment 3 (i.e., music tempo data) and audio frame data 2S 4 (i.e., music audio frame data). The extraction process may be implemented by using an existing neural network model, which is not described in detail herein.
It should be appreciated that the computer device may acquire 2S with the music to be identified 1 Associated phoneme state parameters. Here, the phoneme state parameter is used to describe the number of states corresponding to each phoneme, for example, if the phoneme state parameter is 5, it means that one phoneme may correspond to five states, and specifically may include a first state (for example, "creating state"), a second state (for example, "initial sound state"), a third state (for example, "continuous sound state"), a fourth state (for example, "ending sound state"), and a fifth state (for example, "ending state"). Alternatively, if the phoneme state parameter is 3, it means that one phoneme may correspond to three states, and specifically may include a first state (e.g., "start sound state"), a second state (e.g., "continuous sound state"), and a third state (e.g., "end sound state"). The phoneme state parameter may be dynamically selected according to the actual service requirement, which will not be limited herein.
Further, the computer device may be based on cadence data 2S 3 And phoneme state parameters for audio frame data 2S 4 Performing state alignment processing to obtain a phone start-stop time set 2S associated with N phones 5 N is a positive integer. Wherein the phonemes are based onThe minimum phonetic unit divided by the natural attribute of the speech. For example, if the number of phonemes in the present application takes 3 as an example, then the phone start-stop time set 2S 5 The start-stop time of phoneme 1, the start-stop time of phoneme 2, and the start-stop time of phoneme 3 may be included. Wherein the start-stop time of the phoneme 1 refers to the audio frame data 2S 4 From frame 1 to frame 3.
The computer device may then 2S the audio frame data 4 Performing feature extraction processing to obtain acoustic features 2X 1 Further, the acoustic feature can be 2X 1 Phoneme start-stop time set 2S 4 Input to acoustic model 200B, acoustic feature probability 2G of the musical phoneme sequence associated with the N phonemes is determined by acoustic model 200B 11 (i.e., musical acoustic feature probability). For example, the computer device acquires a phoneme sequence 2S composed of a phoneme 1, a phoneme 2 and a phoneme 3 6 Can be "[ H ]][AO][VN]”。
Meanwhile, the computer device can be based on 2S for music to be identified 1 Is used for acquiring the phoneme sequence 2S 6 A corresponding candidate text set. For example, candidate text 2S in the candidate text set in FIG. 2 71 May be "good vignetting", candidate text 2S 72 May be "good fortune".
The computer device may be based on acoustic feature probability 2G 11 Candidate text 2S 71 Corresponding text sequence probability 2G 21 With candidate text 2S 72 Corresponding text sequence probability 2G 22 From candidate text 2S 71 With candidate text 2S 72 To-be-identified music 2S 1 Corresponding music text data. For example, the computer device may be configured to determine the acoustic feature probability based 2G 11 Candidate text 2S 71 Corresponding text sequence probability 2G 21 Determining the candidate text 2S 71 Is a text match probability of (c). Similarly, the computer device may also be based on acoustic feature probability 2G 11 Candidate text 2S 72 Corresponding text sequence probability 2G 22 Determining the candidate text 2S 72 In turn, can be derived from these two textsThe candidate text with the highest text matching probability is selected from the matching probabilities (e.g., candidate text 2S 71 ) 2S as music to be identified 1 Corresponding music text data.
In the embodiment of the application, the computer device treats the music 2S to be identified through the audio identification model 2000A comprising the acoustic model 200B and the language model 200A 1 Identifying to obtain a phoneme starting and ending time set 2S 5 Further, the computer device may predict the phoneme sequence 2S with greater accuracy through the acoustic model 200B 6 . Further, the computer device 2S through the phoneme sequence 6 Can obtain more fitting music 2S to be identified 1 Acoustic feature probability 2G of (a) 11 Furthermore, the computer device may be based on the phoneme sequence 2S 6 Acquiring music to be identified 2S 1 Candidate text 2S with stronger relevance 71 And further, the text sequence probability corresponding to each candidate text is obtained through the language model 200A. In summary, the computer device can more accurately perform the recognition processing on the music to be recognized through the audio recognition model 2000A.
Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 3, the method may be performed by a computer device, which may be any one of the terminal devices in the terminal cluster shown in fig. 1, for example, the terminal device 100a, or may be the server 10F shown in fig. 1, which is not limited herein. For ease of understanding, embodiments of the present application will be described with the method being performed by a computer device as an example, the data processing method may at least include the following steps S101-S104:
Step S101, acquiring music dry sound data in the music to be identified, and extracting music rhythm data and music audio frame data in the music dry sound data respectively.
Specifically, when the computer device obtains the music to be identified, the computer device may perform a sound source separation process on the music to be identified, so as to strip background music such as accompaniment from the music audio identification model, and further may use the stripped clean pure voice data or virtual voice data as music dry voice data. Further, the computer device extracts music tempo data of the music dry sound data and music audio frame data, respectively, from the music dry sound data.
The number of the types of the languages of the music to be identified may be plural, and may specifically include a first language (for example, chinese), a second language (for example, english), a third language (for example, french), and other languages, and in addition, the service types of the music to be identified may include songs, dramas, comments, and musical dramas.
Step S102, based on the music rhythm data and the phoneme state parameters, the state alignment processing is performed on the music audio frame data, so as to obtain a phoneme starting and ending time set associated with N phonemes.
Wherein N is a positive integer. The music tempo data is composed of P pitches, P being a positive integer less than or equal to N, N being the total number of phonemes corresponding to the P pitches. Specifically, the computer device may perform initial alignment processing on the music audio frame data through P pitch and phoneme state parameters, so as to obtain first alignment data. The first alignment data may be used to indicate a first start time for each of the N phones. Wherein the music audio frame data may include audio frames V i I is a positive integer greater than or equal to Q, and Q is the number of audio frames corresponding to the music audio frame data. The computer device may obtain the audio frame V via the first alignment data i The corresponding state synthesis probability. Wherein the state synthesis probability may be represented by the audio frame V i Corresponding state transition probabilities and audio frames V i The corresponding state emission probabilities are commonly determined. Then, when the computer equipment obtains the state comprehensive probability corresponding to each audio frame, the computer equipment can adjust and align the first alignment data to obtain second alignment data. The computer device may further obtain a second start-stop time corresponding to each of the N phones from the second alignment data, and obtain a phone start-stop time set associated with the N phones based on the N second start-stop times.
Wherein the computer device can obtain pitch Y from P pitches j And determines pitch Y j The number of pitch start-stop frames in the music audio frame data, j is a positive integer less than or equal to P. The computer device may then also be based on the phoneme state parameters and pitch Y j The corresponding number of phonemes, the pitch Y is determined from the number of pitch start-stop frames j A first start time corresponding to each phoneme in the set. Still further, the computer device may also determine first alignment data corresponding to the music audio frame based on a first start time corresponding to each of the N phonemes associated with the P pitches.
Further, referring to fig. 4, fig. 4 is a schematic view of a scenario for performing a state alignment process according to an embodiment of the present application. As shown in fig. 4, the audio frame data 400R is music audio frame data obtained by extracting and processing musical dry sound data in music to be identified by the computer device. The computer device here is any one of the terminal devices in the terminal cluster shown in fig. 1, for example, the terminal device 100a, and may be the server 10F shown in fig. 1, and is not limited thereto.
As shown in fig. 4, the number of audio frames of the audio frame data 400R may be 15, and may specifically include an audio frame V 1 Audio frame V 2 Audio frame V 3 Audio frame V 4 Audio frame V 5 Audio frame V 6 Audio frame V 7 Audio frame V 8 Audio frame V 9 Audio frame V 10 Audio frame V 11 Audio frame V 12 Audio frame V 13 Audio frame V 14 And audio frame V 15 . It will be appreciated that the musical tempo data extracted by the computer device from the musical stem data is made up of 2 pitches (e.g., pitch Y 1 And pitch Y 2 ) The phoneme state parameter obtained by the computer device may take 3 as an example, that means that one phoneme corresponds to three states, and specifically may include a first state, a second state and a third state.
The computer device may perform a state alignment process on the audio frame data 400R based on the music tempo data and the phoneme state parameters to obtain a more accurate set of phoneme start and stop times. The state alignment process herein may include an initial alignment process and an adjustment alignment process. The initial alignment processing method may include a first alignment method (e.g., a method of performing initial alignment processing based on direct division) and a second alignment method (e.g., a method of performing initial alignment processing based on pitch).
In the first alignment manner, since the audio frame data 400R includes 15 audio frames, and the total number of phonemes is exemplified by 3 phonemes, that is, phoneme 1, phoneme 2, and phoneme 3, the computer device may divide each of the phonemes into 5 audio frames according to the frame sequence of the audio frames when performing the initial alignment processing using the direct equally division manner. For example, the computer device may divide the start and stop time of phoneme 1 into the audio-video frames 400R from the audio frame V 1 To audio frame V 5 Dividing the start-stop time of phoneme 2 into the audio-video frames 400R from the audio frame V 6 To audio frame V 10 Dividing the start and stop time of the phoneme 3 into the audio-video frames 400R from the audio frame V 11 To audio frame V 15 Is a time of (a) to be used. Further, the computer device may perform rough average processing again on the start-stop time of each phoneme in accordance with the phoneme state parameter (e.g., 3). For example, for phoneme 1, the computer device may compare the first state of phoneme 1 (e.g., state W 11 ) Assigned to audio frames V 1 And audio frame V 2 A second state of phoneme 1 (e.g., state W 12 ) Assigned to audio frames V 3 And audio frame V 4 A third state (e.g., state W 13 ) Assigned to audio frames V 5 . In the embodiment of the present application, the state of a certain phoneme may be represented by the state W ef The subscript indicates, where e refers to a phone number representing a phone (e.g., phone number 1 for phone 1), and f may represent the state of the phone.
Optionally, in order to reduce the number of calculation steps for iterative alignment, which can be used to reduce the convergence of the alignment data, the computer device may also perform initial alignment processing in a second alignment manner. In the music composing process, one pitch usually corresponds to one character, and the number of the characters in the music text data corresponding to the music to be recognized can be obtained by acquiring the number of the pitches, so that the computer equipment can perform initial alignment processing by using the prior information of the pitches in the music recognition process, namely, the computer equipment can directly use the pitches in the music rhythm data to position the start-stop frame number corresponding to a single character.
For example, for pitch Y 1 (e.g., comprising 2 phones) the computer device may determine pitch Y 1 Pitch start-stop frame number (e.g., audio frame V) in audio frame data 400R 1 To audio frame V 8 ) And can be based on the phoneme state parameter and the pitch Y 1 The corresponding number of phonemes, the pitch Y is determined from the number of pitch start-stop frames 1 Start-stop time corresponding to each phoneme in the sequence. Wherein the computer device can determine the pitch Y 1 The start-stop time of one of the phonemes (e.g., phoneme 1) is from the audio frame V 1 To audio frame V 4 And the first state of phoneme 1 (e.g., state W 11 ) Assigned to audio frames V 1 A second state of phoneme 1 (e.g., state W 12 ) Assigned to audio frames V 2 And audio frame V 3 A third state (e.g., state W 13 ) Assigned to audio frames V 4 . In addition, the computer device can also determine the pitch Y 1 The start-stop time of another phoneme (e.g., phoneme 2) in (a) is from the audio frame V 5 To audio frame V 8 And the first state of phoneme 2 (e.g., state W 21 ) Assigned to audio frames V 5 A second state of phoneme 2 (e.g., state W 22 ) Assigned to audio frames V 6 And audio frame V 7 A third state of phoneme 2 (e.g., state W 23 ) Assigned to audio frames V 8
Similarly, for pitch Y 2 (e.g., including 1 phoneme, namely phoneme 3), the computer device may determine the pitch Y 2 Pitch start-stop frame number (e.g., audio frame V) in audio frame data 400R 9 To audio frame V 15 ) And can be based on the phoneme state parameter and the pitch Y 2 The corresponding number of phonemes, the pitch Y is determined from the number of pitch start-stop frames 2 The start and stop times corresponding to phoneme 3 in the table. Wherein the computer device can determine the start-stop time of the phoneme 3 as the slave audio frame V 9 To audio frame V 15 And the first state of phoneme 3 (e.g., state W 31 ) Assigned to audio frames V 9 Audio frame V 10 And audio frame V 11 A second state of phoneme 3 (e.g., state W 32 ) Assigned to audio frames V 12 Audio frame V 13 And audio frame V 14 A third state of phoneme 3 (e.g., state W 33 ) Assigned to audio frames V 15 . Finally, the computer device may derive alignment data 401S (i.e., first alignment data) corresponding to the determined audio frame data 400R based on the start-stop times corresponding to each of the three phonemes. In this embodiment of the present application, the start-stop time of a phoneme in the first alignment data may be referred to as a first start-stop time.
Further, to obtain more accurate alignment data, the computer device may perform an adjustment alignment process on the alignment data 401S. The adjusting alignment processing mode may include a third alignment mode and a fourth alignment mode. For example, the third alignment may be a hard alignment (e.g., viterbi algorithm) and the fourth alignment may be a soft alignment (e.g., forward and backward algorithm).
For example, the computer device may obtain a state composite probability corresponding to each audio frame based on the alignment data 401S. The state composite probability for an audio frame is determined by both the state transition probability for the audio frame and the state emission probability (e.g., mean and variance) for the audio frame. In the embodiment of the present application, due to the audio frame V 1 For the first frame of the audio frame data 400R, the computer device may therefore continue to hold audio frame V 1 Is the original state of (e.g. state W 11 )。
For audio frame V 2 In other words, the computer device may acquire the audio frame V 1 State W of (2) 11 Transition to state W 11 State transition probability of (2), and audio frame V to be used 1 State W of (2) 11 Transition to state W 12 In turn, the maximum state transition probability of the two state transition probabilities can be taken as the audio frame V 2 State transition probabilities of (a). In addition, the computer device also needs to acquire the audio frame V 2 State emission probability of (c). For example, the computer device may obtain a state emission probability for each state of phoneme 1 in a second state (e.g., state W 12 ) For example, since in the alignment data 401S, the state W of phoneme 1 12 Is assigned to audio frame V 2 And audio frame V 3 The computer device therefore needs to be specific to the audio frame V 2 Extracting features to obtain audio frame V 2 Is characterized by (4, 3) and is specific to audio frame V 3 Extracting features to obtain audio frame V 3 Is characterized by (4, 7). The audio frame characteristic here may be a Mel-frequency cepstrum coefficient (MFCC, mel-scaleFrequency Cepstral Coefficients) characteristic, among others. Based on this, the computer device can determine the state W of the phoneme 1 according to the audio frame characteristics of the two audio frames 12 The mean value of (4, 5), the state W of phoneme 1 12 The variance of (1) is (0, 8). Similarly, the other states of the phoneme 1 will not be described again. Further, the computer device may be based on the audio frame V 2 State transition probabilities of (1) and state emission probabilities corresponding to each state of phoneme 1 respectively to obtain audio frame V 2 Is a state complex probability of (1).
It will be appreciated that reference is made to the above determination of the audio frame V 2 In the embodiment of the state integrated probability of (a), the computer device may obtain the state integrated probability corresponding to each audio frame, and then may perform the alignment adjustment processing on the alignment data 401S through the 15 state integrated probabilities, so as to obtain the alignment data 402S (i.e. the second alignment data) shown in fig. 4. At this time, the computer device may obtain the start-stop time corresponding to each of the three phonemes from the alignment data 402S, for example, in the alignment data 402S, the start-stop time of the phoneme 1 is the slave audio frame V 1 To soundFrequency frame V 5 The start-stop time of the phoneme 2 is from the audio frame V 6 To audio frame V 8 The start-stop time of the phoneme 3 is from the audio frame V 9 To audio frame V 15 Is a time of (a) to be used. The computer device may then derive a set of phone start-stop times associated with the three phones based on the start-stop times corresponding to the three phones. In this embodiment of the present application, the start-stop time of the phoneme in the second alignment data may be referred to as the second start-stop time.
Step S103, determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set.
Wherein the musical acoustic feature probabilities are determined based on a business acoustic model in the musical audio recognition model, the business acoustic model comprising a first sub-model (e.g., a frame-to-mono classification model) and a second sub-model (e.g., a phoneme transition probability model). The first sub-model is used for classifying the acoustic features of the music to identify phonemes corresponding to the acoustic features. The second sub-model is used for performing phoneme conversion processing on phonemes corresponding to the phoneme start-stop time set based on the phoneme start-stop time set. Specifically, the computer device may perform feature extraction processing on the music audio frame data based on the phoneme start-stop time set to obtain music acoustic features (also called frame-to-single-phoneme data). In addition, the computer device may input the musical acoustic feature to the first sub-model, and perform a phoneme recognition process on the musical acoustic feature by using the first sub-model to obtain a phoneme sequence probability corresponding to the initial phoneme sequence. Further, the computer device may determine a phoneme transition probability corresponding to the initial sequence of phonemes based on the set of phoneme start and stop times and the second sub-model. The computer device may then perform a phoneme conversion process on the initial phoneme sequence based on the phoneme conversion probability to obtain a musical phoneme sequence associated with the N phonemes. Finally, the computer device may determine musical acoustic feature probabilities for the musical phoneme sequence based on the phoneme sequence probabilities for the musical phoneme sequence and the phoneme transition probabilities.
It will be appreciated that when the computer device acquires the initial phoneme sequence, the musical acoustic feature needs to be input into the first submodel, so that O initial phoneme sequences and phoneme sequence probabilities corresponding to the O initial phoneme sequences respectively can be obtained, where O is a positive integer. Further, the computer device may input the O initial phoneme sequences to the second sub-model, respectively, and further determine a phoneme transition probability corresponding to each initial phoneme sequence through the second sub-model. At this time, the computer device may determine a sequence screening probability corresponding to each initial phoneme sequence based on a phoneme sequence probability corresponding to each of the O initial phoneme sequences and a phoneme transition probability corresponding to each of the O initial phoneme sequences. The computer device may then determine, from the O sequence screening probabilities, an initial phoneme sequence corresponding to a maximum sequence screening probability as a musical phoneme sequence and determine the maximum sequence screening probability as a musical acoustic feature probability of the musical phoneme sequence.
For example, here, O may take 2 examples, and specifically may include an initial phoneme sequence 1 (for example, a phoneme sequence "[ H ] [ AO ] [ VN ]") and an initial phoneme sequence 2 (for example, a phoneme sequence "[ H ] [ AI ] [ VN ]"), and by the first submodel, the phoneme sequence probability corresponding to the initial phoneme sequence 1 obtained by the computer device may be 60%, and the phoneme sequence probability corresponding to the initial phoneme sequence 2 may be 40%. Further, by the second sub-model, the phoneme conversion probability corresponding to the initial phoneme sequence 1 acquired by the computer device may be 80%, and the phoneme conversion probability corresponding to the initial phoneme sequence 2 may be 20%. At this time, the computer device may determine a sequence screening probability of the initial phoneme sequence 1 based on the phoneme sequence probability corresponding to the initial phoneme sequence 1 and the phoneme transition probability corresponding to the initial phoneme sequence 1, and determine a sequence screening probability of the initial phoneme sequence 2 based on the phoneme sequence probability corresponding to the initial phoneme sequence 2 and the phoneme transition probability corresponding to the initial phoneme sequence 2. Further, the computer device may select an initial phoneme sequence (for example, initial phoneme sequence 1) corresponding to the maximum sequence filtering probability from the initial phoneme sequences, and use the initial phoneme sequence as the musical phoneme sequence, that is, the sequence filtering probability corresponding to the initial phoneme sequence 1 is the musical acoustic feature probability in the embodiment of the present application.
Step S104, based on dictionary data of music to be identified, M candidate texts corresponding to the music phoneme sequences are obtained, and based on the acoustic feature probability of the music and the text sequence probability respectively corresponding to the M candidate texts, music text data corresponding to the music to be identified is determined from the M candidate texts.
Each candidate text is a character combination selected based on a topological path between characters in a character topological graph formed by characters in dictionary data, and corresponds to P characters, wherein P refers to the number of pitches included in music rhythm data, and P is a positive integer. Specifically, when the computer device obtains dictionary data for music to be identified from the music audio recognition model, M candidate texts corresponding to the music phoneme sequence may be obtained based on the dictionary data, where M is a positive integer. Further, the computer device inputs the M candidate texts into a business language model in the music audio recognition model respectively, and the business language model outputs text sequence probabilities corresponding to the M candidate texts respectively. Then, the computer device may obtain text matching probabilities corresponding to the M candidate texts based on the acoustic feature probabilities of the music and the text sequence probabilities corresponding to the M candidate texts, and further may obtain a highest text matching probability from the M text matching probabilities, and use the candidate text corresponding to the highest text matching probability as music text data corresponding to the music to be identified.
Specifically, the text matching probability (i.e., K) of the candidate text may be obtained as shown in formula (1):
K=arg max W P(X|W)P(W) (1)
where P (x|w) may be used to represent the probability of the acoustic feature of the music, i.e. the probability of the music X to be identified given the candidate text W, i.e. the likelihood of the sentence coming into this crosstalk, and P (W) may represent the probability of the text sequence of the candidate text W.
As shown in FIG. 2, the candidate text determined by the computer device may include candidate text 2S 71 (e.g., "good vignetting") and candidate text 2S 72 (e.g., "good fortune").Based on this, the computer device can calculate 2G based on acoustic feature probability 11 (i.e., musical acoustic feature probability) and candidate text 2S 71 Corresponding text sequence probability 2G 21 Determining the candidate text 2S 71 Is a text match probability of (c). Similarly, the computer device may also be based on acoustic feature probability 2G 11 Candidate text 2S 72 Corresponding text sequence probability 2G 22 Determining the candidate text 2S 72 In turn, the candidate text with the highest text match probability (e.g., candidate text 2S) may be selected from the two text match probabilities 71 ) 2S as music to be identified 1 Corresponding music text data.
In the embodiment of the present application, when performing music recognition, since the music to be recognized often includes background music, in order to improve accuracy of subsequent audio recognition, the computer device needs to strip the background music from the music to be recognized, so as to obtain music dry sound data. Furthermore, the computer equipment not only needs to acquire the music audio frame data from the music dry sound data, but also needs to acquire the music rhythm data from the music dry sound data, so that when the subsequent state alignment processing is carried out on the music audio frame data, a more accurate phoneme starting and ending time set can be obtained, and the accuracy of audio recognition is further improved. In addition, the computer equipment performs audio recognition processing through the music audio recognition model, and when the number of music to be recognized is large, the recognition time can be shortened, so that the efficiency of audio recognition is improved.
Further, referring to fig. 5, fig. 5 is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 5, the method may be performed by a computer device, which may be any one of the terminal devices in the terminal cluster shown in fig. 1, for example, the terminal device 100a, or may be the server 10F shown in fig. 1, which is not limited herein. The data processing method may include at least the following steps S201 to S210:
step S201, acquiring music dry sound data in the music to be identified, and extracting music rhythm data and music audio frame data in the music dry sound data respectively.
In step S202, based on the music tempo data and the phoneme state parameters, the state alignment processing is performed on the music audio frame data to obtain a phoneme start-stop time set associated with N phonemes.
Step S203, determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set.
Step S204, based on dictionary data for music to be identified, M candidate texts corresponding to the music phoneme sequences are acquired, and based on the acoustic feature probabilities of the music and the text sequence probabilities respectively corresponding to the M candidate texts, music text data corresponding to the music to be identified are determined from the M candidate texts.
The data processing method in the embodiment of the application can comprise a model training process and a model application process. It can be understood that the steps S201 to S204 illustrate a model application process, and the detailed implementation of the model application process can be referred to the description of the steps S101 to S104 in the embodiment corresponding to fig. 3, which will not be repeated here.
The model training process may be specifically described in the following steps S205 to S210.
In step S205, when sample data including sample audio data and sample text data is acquired, sample tempo data, sample audio frame data, and sample pitch data are extracted from sample dry sound data in the sample audio data, respectively.
Specifically, the computer device may obtain sample data including sample audio data and sample text data, and further may perform a sound source separation process on the sample audio data based on the obtained initial audio recognition model, to obtain sample dry sound data of pure human voice. Further, the computer device may extract sample tempo data, sample audio frame data and sample pitch data, respectively, from the sample dry sound data. The sample rhythm data may be data of rhythm information for sample dry sound data; the sample audio frame data may be data obtained by performing framing processing on sample dry sound data; the sample pitch data may be a fundamental audio portion (e.g., F0 data) of sample dry sound data, where the sample pitch data refers to a sine wave having the lowest frequency obtained by frequency decomposing the sample dry sound data. In the embodiment of the present application, for the specific step of acquiring the sample rhythm data and the sample audio frame data, please refer to the step of acquiring the music rhythm data and the music audio frame data in fig. 3, and the detailed description thereof will not be repeated here.
It should be appreciated that the sample text data herein may include raw text data and lyric text data, and the sample audio data may include audio data of various service types (e.g., songs, dramas, critics, and other types of musical dramas), various language types (e.g., chinese, english, french, and other languages). Wherein the sample audio data carries a sample tag; the sample tags may be used to characterize the actual text data (i.e., actual lyric data) to which the sample audio data corresponds.
Step S206, dictionary data in the initial audio recognition model is obtained, and phoneme conversion processing is carried out based on actual text data, dictionary data and sample pitch data to obtain a sample phoneme string;
specifically, the computer device may obtain dictionary data in the initial audio recognition model, perform phoneme conversion processing on the actual text data based on the dictionary data, and determine an initial phoneme string corresponding to the sample pitch data. Wherein the initial phone string may carry a first tone. The computer device may also obtain a pitch modification rule matching the audio type of the sample audio data, and determine a pitch modification parameter corresponding to the initial phoneme string in the pitch modification rule based on a pitch frequency interval to which the initial phoneme string belongs. Further, the computer device may change the first tone to the second tone based on the tone change parameter, and determine an initial phoneme string having the second tone as a sample phoneme string.
Further, referring to table 1, table 1 is an exemplary table for performing phoneme conversion processing based on dictionary data provided in the embodiments of the present application. Wherein table 1 may include a text data field, a text conversion string field, and an initial phoneme string field. Here, the text conversion string column may be used to convert a character string of text data, for example, pinyin corresponding to chinese, phonetic symbols corresponding to english, or the like. It will be appreciated that a text may correspond to one text conversion string or may correspond to multiple text conversion strings (i.e., the text is polyphone), for example, text 1 in table 1 may correspond to a text conversion string that includes "hao3" and "hao4". Wherein one text-conversion string here corresponds to one initial phoneme string. As shown in table 1:
TABLE 1
The pitch change rule obtained by the computer device is used for indicating the corresponding relation between pitch data corresponding to different audio frequency intervals and pitch change, and the pitch change rule can be obtained by statistics of the computer device after being based on a large number of audio data sets or can be configured for audio data of audio type, and is not limited herein. For ease of understanding, further referring to table 2, table 2 is a table of pitch modification rules for matching audio types provided in an embodiment of the present application. The pitch modification rule table may include a plurality of pitch frequency bins, each of which may correspond to a different pitch modification parameter. As shown in table 2:
TABLE 2
Pitch frequency interval Altering parameters
Interval 1: (T) 0 ,T 1 ] Lowering Z 1 With a tone up to the lowest tone
Interval 2: (T) 1 ,T 2 ] Maintaining the original tone
Interval 3: greater than T 2 Rise Z 2 With a tone up to the highest tone
The pitch frequency intervals in the pitch modification rule table shown in Table 2 may be exemplified by 3 pitch frequency intervals, T here 0 May be a lower limit value frequency (e.g., 0 HZ), T 1 May be a first demarcation numerical frequency (e.g., 300 HZ), T 2 The second divisor value frequency (e.g., 600 HZ) may be used. Wherein T is herein 1 And T 2 May be dynamically adjusted by the computer device according to actual service requirements, and will not be limited herein. It will be appreciated that if the type of audio matched by the pitch modification rule table shown in Table 2 includes H tones, here Z 1 And Z 2 May be a positive integer less than H.
Further, the computer device performs a phoneme conversion process based on the actual text data, dictionary data, and sample pitch data, and a specific embodiment of obtaining a sample phoneme string may be referred to as table 3 below, and table 3 provides an example table for determining a sample phoneme string. As shown in table 3:
TABLE 3 Table 3
In table 3, the modification parameter corresponding to the section 1 may be 2 tones down to the lowest tone, the modification parameter corresponding to the section 2 may be the original tone, and the modification parameter corresponding to the section 3 may be 3 tones up to the highest tone. For example, since the audio type corresponding to the sample audio data includes 4 tones, it may specifically include tone 1, tone 2, tone 3, and tone 4, with tone 1 being higher than tone 2, tone 2 being higher than tone 3, tone 3 being higher than tone 4, i.e., tone 1 being the highest tone, and tone 4 being the lowest tone.
The embodiment of the present application may take actual text data "halo" as an example, to illustrate a specific implementation of determining a sample phoneme string. For example, the computer device may perform a phoneme conversion process on the text data "halo" based on the dictionary data to obtain an initial phoneme string (i.e., "VN 1") shown in table 1. When the pitch frequency interval to which the initial phoneme string belongs is interval 1, the computer device may reduce the pitch of the initial phoneme string by 2 pitches, and since the original pitch is already pitch 1, the computer device may directly change the pitch of the initial phoneme string to pitch 3 to obtain a sample phoneme string (i.e. "VN 3"). Alternatively, when the pitch frequency interval to which the initial phoneme string belongs is interval 2, the computer device may keep the pitch of the initial phoneme string, i.e., the computer device may directly determine the initial phoneme string as a sample phoneme string (i.e., "VN 1") without changing the pitch of the initial phoneme string. Alternatively, when the pitch frequency interval to which the initial phoneme string belongs is interval 3, the computer device may raise the pitch of the initial phoneme string by 3 pitches, but since the original pitch is already pitch 1, that is, already the highest pitch, the computer device may directly determine the initial phoneme string having the highest pitch as the sample phoneme string (that is, "VN 1").
Step S207, based on the sample rhythm data and the phoneme state parameters, carrying out state alignment processing on the sample audio frame data to obtain a sample start-stop time set associated with the sample phoneme string;
specifically, the computer device may perform initial alignment processing on the sample audio frame data through the number of pitches and the phoneme state parameters indicated by the sample rhythm data, so as to obtain first sample alignment data. The first sample alignment data herein may be used to indicate a first start time for each phoneme in the sample phoneme string. The computer equipment can acquire the state comprehensive probability corresponding to each audio frame in the sample audio frame data through the first sample alignment data, and further can adjust and align the first sample alignment data to obtain second sample alignment data until the second sample alignment data after the adjustment and alignment processing is converged. The computer device may further obtain a second start-stop time corresponding to each phoneme in the sample phoneme string from the converged second sample alignment data to obtain a sample start-stop time set associated with the sample phoneme string. For a specific implementation of determining the sample start-stop time set in the embodiment of fig. 3, reference may be made to the description of the phoneme start-stop time set in step S102 in the embodiment corresponding to fig. 3, which will not be further described herein.
It can be appreciated that, in the embodiment of the present application, the phoneme partition included in the sample phoneme string may be dynamically selected according to the recognition accuracy. For example, for the sample phone string "HAO3", when the recognition accuracy is the first accuracy (e.g., highest accuracy), the computer device may divide it into 3 phones, e.g., phone "H", phone "a", and phone "O". For another example, when the recognition accuracy is a second accuracy (e.g., an intermediate accuracy), the computer device may also divide this sample phone string into 2 phones, e.g., phone "H" and phone "AO". For another example, when the recognition accuracy is the third accuracy (e.g., the lowest accuracy), the computer device may also divide this sample phone string into 1 phone, for example, directly into phones "HAO".
Step S208, determining sample acoustic feature probabilities of sample phoneme sequences associated with the sample phoneme strings based on the sample acoustic features corresponding to the sample audio frame data and the sample start-stop time set;
specifically, the computer device may perform feature extraction processing on the sample audio frame data based on the sample start-stop time set, so as to obtain the acoustic feature of the sample. In addition, the computer device may input the sample acoustic feature to an initial acoustic model in the initial audio recognition model, and perform phoneme recognition processing on the sample acoustic feature by using the initial acoustic model to obtain a sample phoneme sequence probability corresponding to the sample initial phoneme sequence. Further, the computer device may determine a sample phoneme transition probability corresponding to the sample initial phoneme sequence based on the sample start-stop time set and the initial acoustic model. The computer device may then perform a phoneme conversion process on the sample initial phoneme sequence based on the sample phoneme conversion probability to obtain a sample phoneme sequence associated with the sample phoneme string. Finally, the computer device may determine a sample acoustic feature probability for the sample phoneme sequence based on the sample phoneme sequence probability for the sample phoneme sequence and the sample phoneme transition probability. For a specific implementation of determining the probability of the acoustic feature of the sample in the embodiment of fig. 3, reference may be made to the description of the probability of the acoustic feature of music in step S103 in the embodiment corresponding to fig. 3, which will not be further described herein.
Step S209, obtaining predicted text data corresponding to the sample phoneme sequence based on the sample text data, the dictionary data and the sample acoustic feature probability of the sample phoneme sequence.
In particular, when the dictionary data is obtained from the initial audio recognition model, the computer device may obtain a plurality of candidate texts corresponding to the sample phoneme sequence based on the dictionary data. Further, the computer device inputs the plurality of candidate texts into an initial language model in the initial audio recognition model respectively, and the initial language model outputs sample text sequence probabilities corresponding to the plurality of candidate texts respectively. Then, the computer device may obtain text matching probabilities corresponding to the plurality of candidate texts based on the sample acoustic feature probabilities and sample text sequence probabilities corresponding to the plurality of candidate texts, respectively, and may further obtain a highest text matching probability from the plurality of text matching probabilities, and use the candidate text corresponding to the highest text matching probability as predicted text data corresponding to the sample phoneme sequence. For the specific implementation of determining the predicted text data in the embodiment of the present application, reference may be made to the description of the music text data in step S104 in the embodiment corresponding to fig. 3, and the description will not be repeated here.
Step S210, training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model.
Specifically, the computer device may determine a first model loss of an initial language model in the initial audio recognition model based on the original text data and the lyric text data in the sample text data, and further train the initial language model based on the first model loss to obtain the business language model. Meanwhile, the computer equipment can also determine second model loss of the initial acoustic model in the initial audio recognition model based on the actual text data and the predicted text data, and further can train the initial acoustic model based on the second model loss to obtain a business acoustic model. Further, the computer device may take an initial audio recognition model including a business language model and a business acoustic model as the music audio recognition model. The music audio recognition model is used for predicting music text data of music to be recognized.
It will be appreciated that the computer device may also obtain a first model convergence condition associated with the initial language model and a second model convergence condition associated with the initial acoustic model, where the model convergence conditions may each be that model loss continues for N (e.g., 10) rounds without continuing to drop, i.e., model training is stopped. Alternatively, the model convergence conditions may also be that the model loss is less than a loss threshold in the model convergence conditions, i.e., model training is stopped. It will not be limited here.
When the computer equipment trains the initial language model, the original text data and the lyric text data can be input into the initial language model together so as to train the initial language model. Optionally, in order to increase the convergence speed of the model and reduce the training frequency of the model, when the number set of the original text data is far greater than that of the lyric text data, the computer device may further input the original text data into the initial language model for training, then input the lyric text data into the trained initial language model for fine tuning training, and when the model after the fine tuning training meets the convergence condition of the first model, obtain the business language model.
The computer device trains the initial acoustic model based on the second model loss, and a model training result can be obtained. And if the model training result indicates that the trained initial acoustic model meets the second model convergence condition, taking the initial acoustic model meeting the second model convergence condition as a business acoustic model. Optionally, if the model training result indicates that the initial acoustic model after the iterative training does not meet the second model convergence condition, the computer device may adjust model parameters of the initial acoustic model based on a model loss function that does not meet the second model convergence condition. Further, the computer device may train the transition acoustic model with the initial acoustic model after the model parameters are adjusted as the transition acoustic model, and use the transition acoustic model meeting the second model convergence condition as the business acoustic model until the trained transition acoustic model meets the second model convergence condition.
Further, referring to fig. 6, fig. 6 is a schematic view of a scenario for model training according to an embodiment of the present application. As shown in fig. 6, the computer device 6S by streaming the audio data 1 The initial audio recognition model 6000A is input (i.e., sample audio data) for model training to obtain predicted text data. Wherein the computer device is acquiring 6S including audio data 1 (i.e., sample audio data) and text data 6S 7 Sample data (i.e., sample text data) can be obtained from the audio data 6S 1 Is dry sound data 6S of 2 (i.e., sample dry sound data), rhythm data 6S are extracted respectively 3 (i.e., sample cadence data), audio frame data 6S 4 (i.e., sample audio frame data) and pitch data 6S 9 (i.e., sample pitch data). Wherein the audio data 6S 1 Carrying a sample tag which can be used to characterize the audio data 6S 1 Corresponding actual text data.
Further, the computer device may obtain dictionary data 6S in the initial audio recognition model 6000A 10 Based on the actual text data and dictionary data 6S 10 Pitch data 6S 9 Performing phoneme conversion to obtain soundString 6S 11 (i.e., a sample phone string). The computer device may also be based on cadence data 6S 3 And phoneme state parameters for audio frame data 6S 4 Performing state alignment processing to obtain a phoneme string 6S 11 Associated start-stop time set 6S 5 (i.e., sample start-stop time set). After this, the computer device may base the audio frame data 6S 4 Corresponding acoustic feature 6X 1 (i.e., sample acoustic features) and start-stop time set 6S 5 Determining the phoneme string 6S 11 Sample acoustic feature probabilities for associated sample phoneme sequences. Furthermore, the computer device may be based on text data 6S 7 Dictionary data 6S 10 And the sample acoustic feature probability is used for acquiring the predicted text data corresponding to the sample phoneme sequence.
Further, the computer device may be based on text data 6S 7 Training the initial audio recognition model 6000A, the actual text data and the predicted text data, results in a music audio recognition model (i.e., audio recognition model 2000A in fig. 2) for predicting the music text data of the music to be recognized. It will be appreciated that the process of model training of the initial audio recognition model 6000A by the computer device may be split into two parts, namely model training for the language model 600A (i.e., initial language model) and acoustic model 600B (i.e., initial acoustic model) separately.
For example, text data 6S here 7 Not only original text data in terms of daily conversations, etc., but also lyric text data. Wherein the lyric text data may include audio data 6S 1 The corresponding actual text data may be text data corresponding to other music data. Therefore, when training the language model 600A, the computer device may input the original text data and the lyric text data into the language model 600A together to train the same to obtain a model training result, and further adjust the model parameters of the language model 600A based on the model training result until a business language model satisfying the first model convergence condition is obtained.
At the same time, the computer device mayTo be based on audio data 6S 1 The model loss (i.e., the second model loss) of the acoustic model 600B is determined according to the actual text data indicated by the sample tag and the predicted text data shown in fig. 6, and then the acoustic model 600B may be trained based on the second model loss to obtain a model training result. If the model training result indicates that the trained acoustic model 600B meets the second model convergence condition, the acoustic model 600B meeting the second model convergence condition is used as the business acoustic model. Alternatively, if the model training result indicates that the iteratively trained acoustic model 600B does not meet the second model convergence condition, the computer device may adjust the model parameters of the acoustic model 600B based on the model loss function that does not meet the second model convergence condition. Further, the computer device may train the transition acoustic model with the acoustic model 600B after the model parameter adjustment as a transition acoustic model, until the trained transition acoustic model meets the second model convergence condition, and take the transition acoustic model meeting the second model convergence condition as a business acoustic model.
Further, the computer device may take an initial audio recognition model 6000A including the business acoustic model and the business language model as the music audio recognition model when training to obtain the business acoustic model and the business language model. According to the embodiment of the application, the sample pitch data is introduced in the training process of the initial audio recognition model, and then the range of the sample pitch data is divided, so that the pitch change aiming at a phoneme can be obtained, and the computer equipment can be more attached to the pitch change habit in singing in the model training process, namely, the music audio recognition model with more accurate prediction text data can be obtained through training.
Further, referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running in a computer device, for example the data processing means is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 7, the data processing apparatus 1 may include: a dry sound data acquisition module 11, a music status alignment module 12, a feature probability determination module 13, and a text data determination module 14.
A dry sound data obtaining module 11, configured to obtain dry sound data of music in the music to be identified, and extract music rhythm data and music audio frame data in the dry sound data of music respectively;
a music state alignment module 12, configured to perform state alignment processing on the music audio frame data based on the music rhythm data and the phoneme state parameters, so as to obtain a phoneme start-stop time set associated with the N phonemes; n is a positive integer;
a feature probability determining module 13, configured to determine a musical acoustic feature probability of a musical phoneme sequence associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set;
the text data determining module 14 is configured to obtain M candidate texts corresponding to the music phoneme sequences based on dictionary data for the music to be identified, and determine music text data corresponding to the music to be identified from the M candidate texts based on the music acoustic feature probabilities and text sequence probabilities respectively corresponding to the M candidate texts; m is a positive integer.
The specific functional implementation manners of the dry sound data obtaining module 11, the music status alignment module 12, the feature probability determining module 13, and the text data determining module 14 may be referred to the steps S101-S104 in the corresponding embodiment of fig. 3, and will not be described herein.
Referring to fig. 7 again, wherein the music tempo data is composed of P pitches; p is a positive integer less than or equal to N; n is the total number of phonemes corresponding to P pitches;
the music status alignment module 12 includes:
an initial alignment unit 121, configured to perform initial alignment processing on the music audio frame data based on the P pitches and the phoneme state parameters, to obtain first alignment data; the first alignment data is used for indicating a first starting time corresponding to each of the N phonemes; the music audio frame data includes audio frames V i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer greater than or equal to Q; q is the number of audio frames corresponding to the music audio frame data;
a comprehensive probability acquisition unit 122 for acquiring an audio frame V based on the first alignment data i The corresponding state comprehensive probability; the state integrated probability is determined by the audio frame V i Corresponding state transition probabilities and audio frames V i The corresponding state emission probability is determined;
an adjustment alignment unit 123, configured to perform adjustment alignment processing on the first alignment data to obtain second alignment data when the state comprehensive probability corresponding to each audio frame is obtained;
the time set obtaining unit 124 is configured to obtain second start and stop times corresponding to each of the N phones from the second alignment data, and obtain a phone start and stop time set associated with the N phones based on the N second start and stop times.
The specific functional implementation manners of the initial alignment unit 121, the comprehensive probability acquisition unit 122, the adjustment alignment unit 123, and the time set acquisition unit 124 may be referred to the step S101 in the corresponding embodiment of fig. 3, and will not be described herein.
Referring to fig. 7 again, the initial alignment unit 121 includes:
a frame number determination subunit 1211 for obtaining a pitch Y from the P pitches j Determining pitch Y j Pitch start-stop frame number in the music audio frame data; j is a positive integer less than or equal to P;
a time determination subunit 1212 for determining a time based on the phoneme status parameter and the pitch Y j The corresponding number of phonemes, the pitch Y is determined from the number of pitch start-stop frames j A first start time corresponding to each phoneme in the set;
an alignment data determination subunit 1213 for determining first alignment data corresponding to the music audio frame based on the first start time corresponding to each of the N phonemes associated with the P pitches.
The specific functional implementation manner of the frame number determining subunit 1211, the time determining subunit 1212 and the alignment data determining subunit 1213 may refer to step S102 in the corresponding embodiment of fig. 3, which is not described herein.
Referring again to fig. 7, wherein the musical dry sound data is determined based on the business acoustic model in the musical audio recognition model; the business acoustic model comprises a first sub-model and a second sub-model;
the feature probability determination module 13 includes:
a feature extraction unit 131, configured to perform feature extraction processing on the music audio frame data based on the phoneme start-stop time set, so as to obtain a music acoustic feature;
a phoneme recognition unit 132, configured to input the musical acoustic feature to a first sub-model, and perform phoneme recognition processing on the musical acoustic feature by using the first sub-model to obtain a phoneme sequence probability corresponding to the initial phoneme sequence;
a transition probability determining unit 133 for determining a phoneme transition probability corresponding to the initial phoneme sequence based on the phoneme start-stop time set and the second sub-model;
a phoneme conversion unit 134 for performing a phoneme conversion process on the initial phoneme sequence based on the phoneme conversion probability to obtain a music phoneme sequence associated with the N phonemes;
the feature determining unit 135 is configured to determine a musical acoustic feature probability of the musical phoneme sequence based on the phoneme sequence probability and the phoneme transition probability of the musical phoneme sequence.
The specific functional implementation manners of the feature extraction unit 131, the phoneme recognition unit 132, the transition probability determination unit 133, the phoneme conversion unit 134, and the feature determination unit 135 may be referred to step S103 in the corresponding embodiment of fig. 3, and will not be described herein.
Referring again to fig. 7, the text data determining module 14 includes:
a text obtaining unit 141 for obtaining M candidate texts corresponding to the musical phoneme sequence based on dictionary data when dictionary data for music to be recognized is obtained from the music audio recognition model;
a text input unit 142, configured to input M candidate texts into a service language model in the music audio recognition model, and output text sequence probabilities corresponding to the M candidate texts respectively by the service language model;
a matching probability obtaining unit 143, configured to obtain text matching probabilities corresponding to the M candidate texts based on the musical acoustic feature probabilities and the text sequence probabilities corresponding to the M candidate texts, respectively;
the text determining unit 144 is configured to obtain a highest text matching probability from the M text matching probabilities, and use a candidate text corresponding to the highest text matching probability as music text data corresponding to the music to be identified.
The specific functional implementation manners of the text obtaining unit 141, the text input unit 142, the matching probability obtaining unit 143, and the text determining unit 144 may refer to step S104 in the corresponding embodiment of fig. 3, and are not described herein.
Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running in a computer device, for example the data processing means is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 8, the data processing apparatus 2 may include: a sample audio acquisition module 21, a phoneme string acquisition module 22, a sample state alignment module 23, a sample probability determination module 24, a predicted text acquisition module 25, and a model training module 26.
A sample audio acquisition module 21 for, when sample data including sample audio data and sample text data is acquired, respectively extracting sample rhythm data, sample audio frame data, and sample pitch data from sample dry sound data in the sample audio data; sample audio data carries a sample tag; the sample tag is used for representing actual text data corresponding to the sample audio data;
a phoneme string obtaining module 22, configured to obtain dictionary data in the initial audio recognition model, and perform phoneme conversion processing based on the actual text data, the dictionary data and the sample pitch data to obtain a sample phoneme string;
A sample state alignment module 23, configured to perform state alignment processing on sample audio frame data based on sample rhythm data and phoneme state parameters, so as to obtain a sample start-stop time set associated with a sample phoneme string;
a sample probability determination module 24 for determining a sample acoustic feature probability of a sample phoneme sequence associated with the sample phoneme string based on the sample acoustic feature corresponding to the sample audio frame data and the sample start-stop time set;
a predicted text obtaining module 25, configured to obtain predicted text data corresponding to the sample phoneme sequence based on the sample text data, the dictionary data, and the sample acoustic feature probability of the sample phoneme sequence;
model training module 26 for training the initial audio recognition model based on the sample text data, the actual text data, and the predicted text data to obtain a music audio recognition model; the music audio recognition model is used for predicting music text data of music to be recognized.
The specific functional implementation manners of the sample audio obtaining module 21, the phoneme string obtaining module 22, the sample state alignment module 23, the sample probability determining module 24, the predicted text obtaining module 25 and the model training module 26 may be referred to in the above-mentioned step S301-step S305 in the corresponding embodiment of fig. 5, and will not be described herein again.
Wherein the phoneme string acquisition module 22 comprises:
a phoneme string determining unit 221 for acquiring dictionary data in the initial audio recognition model, performing phoneme conversion processing on the actual text data based on the dictionary data, and determining an initial phoneme string corresponding to the sample pitch data; the initial phone string carries a first tone;
a parameter determining unit 222, configured to obtain a pitch modification rule that matches an audio type of the sample audio data, and determine a pitch modification parameter corresponding to the initial phoneme string in the pitch modification rule based on a pitch frequency interval to which the initial phoneme string belongs;
the tone changing unit 223 is configured to change the first tone to the second tone based on the tone changing parameter, and determine the initial phoneme string having the second tone as the sample phoneme string.
The specific functional implementation manner of the phoneme string determining unit 221, the parameter determining unit 222 and the tone changing unit 223 may refer to step S301 in the corresponding embodiment of fig. 5, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.
Wherein the sample text data comprises original text data and lyric text data;
Model training model 26 includes:
the language model obtaining unit 261 is configured to determine a first model loss of an initial language model in the initial audio recognition model based on the original text data and the lyric text data, and train the initial language model based on the first model loss to obtain a business language model;
an acoustic model obtaining unit 262, configured to determine a second model loss of the initial acoustic model in the initial audio recognition model based on the actual text data and the predicted text data, and train the initial acoustic model based on the second model loss, so as to obtain a service acoustic model;
the music model determining unit 263 is configured to take an initial audio recognition model including a service language model and a service acoustic model as a music audio recognition model.
The specific functional implementation manner of the language model obtaining unit 261, the acoustic model obtaining unit 262 and the music model determining unit 263 may refer to step S301 in the corresponding embodiment of fig. 5, and will not be described herein. In addition, the description of the beneficial effects of the same method is omitted.
Further, referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer device 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 9, the memory 1005, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application.
In the computer device 1000 shown in fig. 9, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
acquiring music dry sound data in music to be identified, and respectively extracting music rhythm data and music audio frame data in the music dry sound data; based on the music rhythm data and the phoneme state parameters, carrying out state alignment processing on the music audio frame data to obtain a phoneme starting and ending time set associated with N phonemes; n is a positive integer; determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set; based on dictionary data of music to be identified, M candidate texts corresponding to music phoneme sequences are obtained, and based on the acoustic feature probability of the music and text sequence probabilities corresponding to the M candidate texts, music text data corresponding to the music to be identified are determined from the M candidate texts; m is a positive integer.
The processor 1001 may also be used to invoke a device control application stored in the memory 1005 to implement:
When sample data comprising sample audio data and sample text data is obtained, respectively extracting sample rhythm data, sample audio frame data and sample pitch data from sample dry sound data in the sample audio data; sample audio data carries a sample tag; the sample tag is used for representing actual text data corresponding to the sample audio data; dictionary data in the initial audio recognition model is obtained, and phoneme conversion processing is carried out on the basis of actual text data, dictionary data and sample pitch data to obtain a sample phoneme string; based on the sample rhythm data and the phoneme state parameters, carrying out state alignment processing on the sample audio frame data to obtain a sample start-stop time set associated with the sample phoneme string; determining a sample acoustic feature probability of a sample phoneme sequence associated with the sample phoneme string based on the sample acoustic feature and the sample start-stop time set corresponding to the sample audio frame data; based on sample text data, dictionary data and sample acoustic feature probability of the sample phoneme sequence, obtaining predicted text data corresponding to the sample phoneme sequence; training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model; the music audio recognition model is used for predicting music text data of music to be recognized.
It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments corresponding to fig. 2, 3, 4, 5 and 6, the description of the data processing apparatus 1 in the embodiments corresponding to fig. 7, and the description of the data processing apparatus 2 in the embodiments corresponding to fig. 8, which are not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.
The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and when executed by a processor, implement a data processing method provided by each step in fig. 2, fig. 3, fig. 4, fig. 5, and fig. 6, and specifically refer to an implementation manner provided by each step in fig. 2, fig. 3, fig. 4, fig. 5, and fig. 6, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.
The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can execute the data processing method in the embodiments corresponding to fig. 2, 3, 4, 5 and 6, which are not described herein. In addition, the description of the beneficial effects of the same method is omitted.
Further, referring to fig. 10, fig. 10 is a schematic structural diagram of a data processing system according to an embodiment of the present application. The data processing system 3 may comprise data processing means 10a and data processing means 10b. The data processing apparatus 10a may be the data processing apparatus 1 in the embodiment corresponding to fig. 7, and therefore, a detailed description thereof will not be provided here. The data processing device 10b may be the data processing device 2 in the embodiment corresponding to fig. 8, and therefore, a detailed description thereof will not be provided here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the data processing system according to the present application, please refer to the description of the method embodiments of the present application.
The term "comprising" and any variations thereof in the description of the embodiments of the present application and in the claims and drawings is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The methods and related devices provided in the embodiments of the present application are described with reference to the method flowcharts and/or structure diagrams provided in the embodiments of the present application, and each flowchart and/or block of the method flowcharts and/or structure diagrams may be implemented by computer program instructions, and combinations of flowcharts and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims (13)

1. A method of data processing, comprising:
acquiring music dry sound data in music to be identified, and respectively extracting music rhythm data and music audio frame data in the music dry sound data;
based on the music rhythm data and the phoneme state parameters, carrying out state alignment processing on the music audio frame data to obtain a phoneme starting and ending time set associated with N phonemes; n is a positive integer;
determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set;
acquiring M candidate texts corresponding to the music phoneme sequences based on dictionary data of the music to be identified, and determining music text data corresponding to the music to be identified from the M candidate texts based on the music acoustic feature probability and text sequence probabilities respectively corresponding to the M candidate texts; m is a positive integer.
2. The method according to claim 1, wherein the music tempo data is composed of P pitches; p is a positive integer less than or equal to N; n is the total number of phonemes corresponding to the P pitches;
the step of performing state alignment processing on the music audio frame data based on the music rhythm data and the phoneme state parameters to obtain a phoneme starting and ending time set associated with N phonemes comprises the following steps:
based on the P pitch and phoneme state parameters, carrying out initial alignment processing on the music audio frame data to obtain first alignment data; the first alignment data is used for indicating a first starting time corresponding to each of the N phonemes; the music audio frame data includes audio frames V i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer greater than or equal to Q; q is the number of audio frames corresponding to the music audio frame data;
acquiring the audio frame V based on the first alignment data i The corresponding state comprehensive probability; the state integrated probability is determined by the audio frame V i Corresponding state transition probabilities and the audio frame V i The corresponding state emission probability is determined;
when the state comprehensive probability corresponding to each audio frame is obtained, the first alignment data are subjected to adjustment alignment processing to obtain second alignment data;
And acquiring second start and stop time corresponding to each of the N phonemes from the second alignment data, and acquiring a phoneme start and stop time set associated with the N phonemes based on the N second start and stop times.
3. The method according to claim 2, wherein the performing initial alignment processing on the music audio frame data based on the P pitch and phoneme state parameters to obtain first alignment data includes:
obtaining pitch Y from the P pitches j Determining the pitch Y j A pitch start-stop frame number in the music audio frame data; j is a positive integer less than or equal to P;
based on the phoneme state parameter and the pitch Y j Corresponding number of phonemes starting from the pitchDetermining the pitch Y in the frame number j A first start time corresponding to each phoneme in the set;
first alignment data corresponding to the music audio frame is determined based on a first start time corresponding to each of the N phones associated with the P pitches.
4. The method of claim 1, wherein the musical dry sound data is determined based on a business acoustic model in a musical audio recognition model; the business acoustic model comprises a first sub-model and a second sub-model;
The determining musical acoustic feature probabilities of musical phoneme sequences associated with the N phonemes based on musical acoustic features corresponding to the musical audio frame data and the phoneme start-stop time set includes:
based on the phoneme starting and ending time set, carrying out feature extraction processing on the music audio frame data to obtain music acoustic features;
inputting the music acoustic features into the first sub-model, and performing phoneme recognition processing on the music acoustic features by the first sub-model to obtain phoneme sequence probabilities corresponding to initial phoneme sequences;
determining a phoneme transition probability corresponding to the initial phoneme sequence based on the phoneme start-stop time set and the second sub-model;
performing phoneme conversion processing on the initial phoneme sequence based on the phoneme conversion probability to obtain a music phoneme sequence associated with the N phonemes;
based on the phoneme sequence probability of the music phoneme sequence and the phoneme transition probability, determining a musical acoustic feature probability of the music phoneme sequence.
5. The method according to claim 1, wherein the obtaining M candidate texts corresponding to the music phoneme sequence based on dictionary data for the music to be identified, and determining music text data corresponding to the music to be identified from the M candidate texts based on the music acoustic feature probability and text sequence probabilities respectively corresponding to the M candidate texts, includes:
When dictionary data aiming at the music to be identified are acquired from a music audio identification model, M candidate texts corresponding to the music phoneme sequence are acquired based on the dictionary data;
inputting the M candidate texts into a business language model in the music audio recognition model, and outputting text sequence probabilities corresponding to the M candidate texts respectively by the business language model;
based on the music acoustic feature probability and the text sequence probabilities respectively corresponding to the M candidate texts, obtaining text matching probabilities respectively corresponding to the M candidate texts;
and acquiring the highest text matching probability from the M text matching probabilities, and taking the candidate text corresponding to the highest text matching probability as the music text data corresponding to the music to be identified.
6. A method of data processing, comprising:
when sample data comprising sample audio data and sample text data are obtained, respectively extracting sample rhythm data, sample audio frame data and sample pitch data from sample dry sound data in the sample audio data; the sample audio data carries a sample tag; the sample tag is used for representing actual text data corresponding to the sample audio data;
Dictionary data in an initial audio recognition model is obtained, and phoneme conversion processing is carried out on the basis of the actual text data, the dictionary data and the sample pitch data to obtain a sample phoneme string;
based on the sample rhythm data and the phoneme state parameters, carrying out state alignment processing on the sample audio frame data to obtain a sample start-stop time set associated with the sample phoneme string;
determining a sample acoustic feature probability of a sample phoneme sequence associated with the sample phoneme string based on the sample acoustic feature corresponding to the sample audio frame data and the sample start-stop time set;
based on the sample text data, the dictionary data and the sample acoustic feature probability of the sample phoneme sequence, obtaining predicted text data corresponding to the sample phoneme sequence;
training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model; the music audio recognition model is used for predicting music text data of music to be recognized.
7. The method of claim 6, wherein the obtaining dictionary data in the initial audio recognition model, performing a phoneme conversion process based on the actual text data, the dictionary data, and the sample pitch data to obtain a sample phoneme string, comprises:
Dictionary data in an initial audio recognition model are acquired, phoneme conversion processing is carried out on the actual text data based on the dictionary data, and initial phoneme strings corresponding to the sample pitch data are determined; the initial phone string carries a first tone;
acquiring a pitch change rule matched with the audio type of the sample audio data, and determining a pitch change parameter corresponding to the initial phoneme string in the pitch change rule based on a pitch frequency interval to which the initial phoneme string belongs;
and changing the first tone to a second tone based on the tone changing parameter, and determining an initial phoneme string having the second tone as a sample phoneme string.
8. The method of claim 6, wherein the sample text data comprises raw text data and lyric text data;
training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model, wherein the training comprises the following steps:
determining a first model loss of an initial language model in the initial audio recognition model based on the original text data and the lyric text data, and training the initial language model based on the first model loss to obtain a business language model;
Determining a second model loss of an initial acoustic model in the initial audio recognition model based on the actual text data and the predicted text data, and training the initial acoustic model based on the second model loss to obtain a business acoustic model;
and taking an initial audio recognition model comprising the service language model and the service acoustic model as a music audio recognition model.
9. A data processing apparatus, comprising:
the system comprises a dry sound data acquisition module, a music audio frame data acquisition module and a music data processing module, wherein the dry sound data acquisition module is used for acquiring music dry sound data in music to be identified and respectively extracting music rhythm data and music audio frame data in the music dry sound data;
the music state alignment module is used for carrying out state alignment processing on the music audio frame data based on the music rhythm data and the phoneme state parameters to obtain a phoneme starting and ending time set associated with N phonemes; n is a positive integer;
a feature probability determining module, configured to determine a musical acoustic feature probability of a musical phoneme sequence associated with the N phonemes based on a musical acoustic feature corresponding to the musical audio frame data and the phoneme start-stop time set;
The text data determining module is used for acquiring M candidate texts corresponding to the music phoneme sequences based on dictionary data of the music to be identified, and determining music text data corresponding to the music to be identified from the M candidate texts based on the music acoustic feature probability and text sequence probabilities respectively corresponding to the M candidate texts; m is a positive integer.
10. A data processing apparatus, comprising:
a sample audio acquisition module, configured to, when sample data including sample audio data and sample text data is acquired, respectively extract sample rhythm data, sample audio frame data, and sample pitch data from sample dry sound data in the sample audio data; the sample audio data carries a sample tag; the sample tag is used for representing actual text data corresponding to the sample audio data;
the phoneme string acquisition module is used for acquiring dictionary data in the initial audio recognition model, and carrying out phoneme conversion processing based on the actual text data, the dictionary data and the sample pitch data to obtain a sample phoneme string;
the sample state alignment module is used for carrying out state alignment processing on the sample audio frame data based on the sample rhythm data and the phoneme state parameters to obtain a sample start-stop time set associated with the sample phoneme string;
A sample probability determining module, configured to determine a sample acoustic feature probability of a sample phoneme sequence associated with the sample phoneme string based on a sample acoustic feature corresponding to the sample audio frame data and the sample start-stop time set;
the predicted text acquisition module is used for acquiring predicted text data corresponding to the sample phoneme sequence based on the sample text data, the dictionary data and the sample acoustic feature probability of the sample phoneme sequence;
the model training module is used for training the initial audio recognition model based on the sample text data, the actual text data and the predicted text data to obtain a music audio recognition model; the music audio recognition model is used for predicting music text data of music to be recognized.
11. A computer device, comprising: a processor, a memory, and a network interface;
the processor is connected to a memory, a network interface for providing data communication functions, the memory for storing a computer program, the processor for invoking the computer program to cause the computer device to perform the method of any of claims 1-8.
12. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, which computer program is adapted to be loaded and executed by a processor to cause a computer device with a processor to perform the method of any of claims 1-8.
13. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium, the computer program being adapted to be read and executed by a processor to cause a computer device with a processor to carry out the steps of the method according to any one of claims 1-8.
CN202211027484.XA 2022-08-25 2022-08-25 Data processing method, device, equipment, storage medium and program product Pending CN117672184A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211027484.XA CN117672184A (en) 2022-08-25 2022-08-25 Data processing method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211027484.XA CN117672184A (en) 2022-08-25 2022-08-25 Data processing method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117672184A true CN117672184A (en) 2024-03-08

Family

ID=90077415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211027484.XA Pending CN117672184A (en) 2022-08-25 2022-08-25 Data processing method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117672184A (en)

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN108305643B (en) Method and device for determining emotion information
CN103956169B (en) A kind of pronunciation inputting method, device and system
JP5141695B2 (en) Symbol insertion device and symbol insertion method
CN111090727B (en) Language conversion processing method and device and dialect voice interaction system
KR20190120353A (en) Speech recognition methods, devices, devices, and storage media
CN112309365B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
CN111916058A (en) Voice recognition method and system based on incremental word graph re-scoring
CN109976702A (en) A kind of audio recognition method, device and terminal
CN113314119B (en) Voice recognition intelligent household control method and device
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN111161695B (en) Song generation method and device
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN113813609B (en) Game music style classification method and device, readable medium and electronic equipment
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN117558259A (en) Digital man broadcasting style control method and device
CN108899016B (en) Voice text normalization method, device and equipment and readable storage medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
KR101482148B1 (en) Group mapping data building server, sound recognition server and method thereof by using personalized phoneme
CN115050351A (en) Method and device for generating timestamp and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination