CN117727306A

CN117727306A - Pickup translation method, device and storage medium based on original voiceprint features

Info

Publication number: CN117727306A
Application number: CN202311773493.8A
Authority: CN
Inventors: 郑晓辉; 牟欣语
Original assignee: Qingdao Runhengyi Technology Co ltd
Current assignee: Qingdao Runhengyi Technology Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-19

Abstract

The invention discloses a pickup translation method, pickup translation equipment and a storage medium based on original voiceprint features, and relates to the technical field of voice recognition. The method comprises the steps of obtaining primary audio; dividing the original audio to obtain a plurality of original phonemes and corresponding sequences; obtaining a plurality of types of voiceprint features of each original phoneme; carrying out semantic recognition on the original audio to obtain an original text; translating the primitive text into translated semantic text; performing phoneme fitting on the translation semantic text to obtain a plurality of translation phonemes and corresponding sequences; and correcting the translated phonemes according to the original phonemes and the corresponding sequence and the corresponding voice print characteristics of a plurality of categories to obtain translated audio. The invention realizes the correction of the voice translation result and the color rendering of the translation result by identifying and extracting the original voiceprint characteristics of the speaker.

Description

Pickup translation method, device and storage medium based on original voiceprint features

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a pickup translation method, pickup translation equipment and a storage medium based on original voiceprint features.

Background

Today, with increasingly growing globalization, cross-language communication has become a daily requirement. To meet this demand, speech translation techniques have been rapidly developed. Conventional speech translation processes typically include four steps of speech signal acquisition, speech recognition (converting speech to text), text translation, and speech synthesis (converting translated text back to speech). While the prior art has made significant advances in speech recognition and machine translation accuracy, there are still some limitations.

The existing speech translation system focuses on the text content of the speech, often ignores rich non-language information contained in the sound, not only results in lack of speech emotion of a speaker in the translation result, but also reduces accuracy of recognition translation.

Disclosure of Invention

The invention aims to provide a pickup translation method, pickup translation equipment and storage medium based on original voiceprint features, which are used for correcting a voice translation result and realizing color rendering of the translation result by identifying and extracting the original voiceprint features of a speaker.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention provides a pickup translation method based on original voiceprint characteristics, which comprises the following steps of,

acquiring primary audio;

dividing the original audio to obtain a plurality of original phonemes and corresponding sequences;

obtaining a plurality of types of voiceprint features of each original phoneme, wherein the types of the voiceprint features comprise frequency spectrum features, formant features and/or sound intensity features;

carrying out semantic recognition on the original audio to obtain primitive text;

translating the primitive species text into translated semantic text;

performing phoneme fitting on the translation semantic text to obtain a plurality of translation phonemes and corresponding sequences;

and correcting the translation phonemes according to the original phonemes, the corresponding sequence and the corresponding voice print characteristics of a plurality of categories to obtain translation audio.

The invention also discloses a pickup translation method based on the original voiceprint characteristics, which comprises the following steps,

acquiring and storing an audio stream in real time;

noise reduction and filtering are carried out on the audio stream to obtain a human voice stream;

acquiring a blank period in the voice stream;

intercepting the human voice stream between blank periods as native audio;

the native audio correction is modified to obtain translated audio.

receiving translated audio;

playing the translated audio.

The invention also discloses a device, which is characterized by comprising,

the microphone is used for recording and obtaining primary audio;

a translation unit for translating the native audio into translated audio

And the loudspeaker is used for playing the translated audio.

The present invention also discloses a storage medium comprising,

the storage medium stores at least one command, at least one program, a code set or an instruction set, and the at least one command, the at least one program, the code set or the instruction set is loaded and executed by a processor to realize a pickup translation method based on the native voiceprint features.

The invention records the original audio of the speaker through the microphone, then recognizes and extracts the original voiceprint characteristics of the speaker through the translation unit, and finally plays the translated audio through the loudspeaker. In the process, the voice characteristics of the speaker can be endowed to the voice synthesized after translation, so that not only is the correction of the voice translation result realized, but also the color rendering of the translation result is realized.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of functional modules and information flow of a pickup translation apparatus according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing a step flow of an embodiment of a pickup translation method based on native voiceprint features according to the present invention;

FIG. 3 is a schematic diagram showing a second step flow of an embodiment of a method for translating pickup based on native voiceprint features according to the present invention;

FIG. 4 is a schematic diagram of a third step flow of an embodiment of a method for translating pickup based on native voiceprint features according to the present invention;

FIG. 5 is a flowchart illustrating the step S7 according to an embodiment of the present invention;

FIG. 6 is a flowchart showing a step S76 according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating a step S762 according to an embodiment of the invention;

FIG. 8 is a second step flow chart of the step S76 according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating a step S77 according to an embodiment of the present invention;

in the drawings, the list of components represented by the various numbers is as follows:

1-microphone, 2-translation unit, 3-speaker.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

Speech translation is the process of converting spoken speech input in one language to spoken speech output in another language. The method combines the voice recognition and the machine translation technology, so that people can communicate and exchange across languages through voice. However, the simultaneous interpretation of the sounds in the prior art uses electronic fitting to produce sounds, and the speech of a real speaker cannot be restored. In view of this, the present invention provides the following.

Referring to fig. 1 to 4, the present invention provides a pickup translating apparatus functionally divided into a microphone 1, a translating unit 2, and a speaker 3. In use, the microphone 1 is used for recording and obtaining the native audio, the rendering unit 2 is used for rendering the native audio into the rendered audio, and the loudspeaker 3 is used for playing the rendered audio. This is of course only a brief description of each functional module, which is described in detail below.

Firstly, the microphone 1 may perform step S011 to acquire and store the audio stream in real time, or the audio stream may be stored by a dedicated storage module without recording. The rendering unit 2 may then perform step S012 to noise-reduce filter the audio stream to obtain a human voice stream. Step S013 may next be performed to obtain a blank period in the human voice stream. Step S014 may next be performed to intercept the human voice stream between blank periods as native audio. Step S015 may then be performed to modify the native audio to obtain translated audio.

The speaker 3 may be provided as a single unit with the microphone 1 and the rendering unit 2, and the speaker 3 may be provided separately, for example, a plurality of speakers 3 may be connected to the rendering unit 2 by wired or wireless means. The speaker 3 may then perform step S021 to receive the translated audio in a pickup translation method based on the native voiceprint features of any one of claims 1 to 6, and finally may perform step S022 to play the translated audio.

In the process of rendering the translated audio for the native audio, the rendering unit 2 may first perform step S1 to obtain the native audio. Step S2 may be performed to divide the native audio into a plurality of original phonemes and corresponding sequences. Step S3 may then be performed to obtain several kinds of voiceprint features for each original phone. Voiceprint features refer to individual unique biological features in a speech signal, similar to a fingerprint or iris. Each person's voice has unique voice characteristics, wherein the categories of voiceprint characteristics include spectral characteristics, formant characteristics, and/or voice intensity characteristics.

Spectral features are characteristic representations of a sound signal in the frequency domain, describing the energy distribution and spectral shape characteristics of the sound signal at different frequencies. Spectral features are characteristic representations of a sound signal in the frequency domain, describing the energy distribution and spectral shape characteristics of the sound signal at different frequencies. The spectral features of the voiceprint reflect the energy distribution and spectral shape features of the sound signal at different frequencies. The vocal print spectral features of everyone are unique in that they are affected by the vocal tract shape of the throat, lips, nasal cavity, etc.

Formant characteristics are an important acoustic feature in voiceprint analysis for describing formant distribution of a sound signal in the frequency domain. Formants are prominent peaks in the sound signal with higher spectral intensities, reflecting the resonant frequencies of the sound signal as it passes through resonant cavities such as vocal cords, throats, and the mouth in the vocal tract system.

The sound intensity feature is a feature describing sound signal intensity (or volume). It reflects the energy level or amplitude level of the sound signal. Sound intensity features are commonly used in the fields of sound processing, audio analysis, voiceprint recognition, and the like.

Step S4 may be performed to semantically identify the native audio to obtain the primitive text. Step S5 may then be performed to translate the primitive seed text into translated semantic text. Rule-based machine translation (Rule-based Machine Translation, RBMT) may be used in this process: this approach relies on manually written translation rules and grammar rules. It analyzes the source language text into a grammar structure and then generates the target language text according to predefined rules. However, this approach requires a lot of manual work and expertise, and may not be flexible enough for complex language structures and expressions. Neuromotor translation (Neural Machine Translation, NMT) may also be used: this approach uses a deep neural network model for translation. It directly maps source language text to target language text by training an end-to-end neural network model. The NMT method performs better in processing long sentences and complex grammar structures and is better able to capture context information. It typically requires a significant amount of training data and computing resources to train and infer.

Step S6 may be performed to perform phoneme fitting on the translated text to obtain a plurality of translated phonemes and a corresponding sequence. Text-to-Speech (TTS) technology may be used. TTS technology converts text input into audible speech output. Many TTS tools and services are now available, both online and offline, including open source libraries and commercial products. These tools typically provide a variety of speech synthesis models and speech styles, from which corresponding speech may be generated from the entered text.

And finally, executing step S7 to correct the translated phonemes according to the original phonemes, the corresponding sequence and the corresponding voice print characteristics of a plurality of categories to obtain translated audio. This process needs to be implemented in conjunction with the voiceprint features of the speaker, as described in more detail below.

Referring to fig. 5, in order to achieve the purpose of modifying the translated phonemes, step S7 may be executed first in the implementation process to perform step S71 to semantically divide the primitive text to obtain a plurality of original text segments and corresponding sequences. Step S72 may be performed to semantically segment the translated text into a plurality of translated text segments and a corresponding sequence. Step S73 may be performed next to obtain a number of original phonemes and a corresponding order for each original text segment. Step S74 may be performed to obtain a number of translated phonemes and a corresponding order for each translated text segment. Step S75 may be executed to perform semantic matching on the translated text segment according to the translation contrast relationship between the primitive text and the translated semantic text to obtain several pairs of the primitive text segment and the translated text segment with the same semantic meaning. Step S76 may be executed to correct the translated text segment corresponding to the translated text segment according to the voiceprint features of the plurality of categories of the original phonemes corresponding to the original text segment. Finally, step S77 may be performed to combine the translated segments according to the sequence of the translated segments to obtain translated audio.

Referring to fig. 9, in order to improve the smoothness of merging the translated audio, step S77 may be executed to first perform step S771 to obtain the tonal characteristics of the native audio. Tone features refer to pitch change patterns in speech. It reflects the pitch difference between different syllables or phones in the speech signal. The tone is an important voice feature in the language, and can convey information such as word sense, mood, emotion and the like. Step S772 may then be performed to assign tonal features of the native audio to the plurality of translated segments combined in the order of the translated text segments to obtain translated audio. The fluency of translating the audio is improved by adjusting the tone characteristics, so that the method is more fit with the true speaking state and the mood of a speaker.

To supplement the above-described implementation procedures of step S71 to step S77, source codes of part of the functional modules are provided, and a comparison explanation is made in the annotation section. In order to avoid data leakage involving trade secrets, a desensitization process is performed on portions of the data that do not affect implementation of the scheme, as follows.

The above code is a high-level framework describing the overall process flow from text to rendering audio. The flow is as follows:

firstly, carrying out semantic segmentation on an original text and a translated text to obtain text segments and sequences thereof, and then obtaining a corresponding phoneme sequence for each text segment. The original text segment and the translated text segment are then matched to ensure semantic consistency. And correcting the translation phonemes by utilizing voiceprint characteristics of the original phonemes, and sequentially combining the corrected translation phonemes to generate translation audio.

Implementation of this process requires specific algorithms for multiple complex steps of semantic segmentation, phoneme extraction, text matching, and voice feature adjustment, which are reduced to frame functions in the above code. In practice, the implementation of each function will involve complex algorithms and possibly deep learning models.

Referring to fig. 6, since a speaker has a long conversation duration, a conversation may be divided into a plurality of pairs of text segments having relevance matching original text segments and translated text segments, and corresponding voiceprint features also have relevance. In view of this, for each pair of the original text segment and the translated text segment that are semantically matched, step S76 may be executed first in the implementation process to vector the voiceprint feature of each type of the original phoneme corresponding to the original text segment to obtain the voiceprint feature vector of each original phoneme corresponding to the original text segment. Step S762 may be executed next to pick out a plurality of feature original phonemes according to the voiceprint feature vector of each original phoneme corresponding to the original text segment, and obtain a duration scaling factor of each feature original phoneme. Step S763 may be performed to obtain the sequence of the feature original phonemes corresponding to the original text segment according to the sequence of all the original phonemes corresponding to the original text segment and the sequence of the feature original phonemes corresponding to the original text segment. Step S764 may be performed next to obtain the total duration of all the translated phonemes corresponding to the translated text segment. Step S765 may be executed to divide the total duration of the translated phonemes corresponding to the translated text segment after the duration scaling factor of each feature original phoneme is arranged according to the order of the feature original phonemes, so as to obtain a plurality of translated phonemes corresponding to each feature original phoneme. Finally, step S766 may be executed to assign the voiceprint features of the plurality of categories of feature original phonemes to the corresponding plurality of translation phonemes, and then combine the feature original phonemes to obtain the translation text segment corresponding to the translation speech segment.

Referring to fig. 8, it is needless to say that not every translated text segment has an original text segment with the same semantic meaning, and therefore, it is also necessary to execute step S767 to determine whether there is a translated text segment for which no corresponding original text segment is semantically matched. If not, step S768 may be executed to obtain, as the semantically matched translated text segment and the corresponding original text segment, an original Wen Benduan corresponding to the semantically matched translated text segment adjacent to the original text segment. And then proceeds to steps S762 to S766.

To supplement the above-described implementation procedures of step S761 to step S766, source codes of part of the functional modules are provided, and a comparison explanation is made in the annotation section.

/>

The above code implements a process of applying the phoneme features of the original text segment to the translated text segment phonemes. The phoneme features of the original text segment are first vectorized, then feature phonemes are selected and their length scaling coefficients are calculated, then these scaling coefficients are applied to the phoneme length assignment of the translated text segment, and the voiceprint features of the feature phonemes of the original text segment are assigned to the corresponding translated phonemes. This code is an example of the nature of a framework, and some functions, such as vectorization of voiceprint features and assignment of phoneme features, require implementation details of these functions to be filled in according to the actual voiceprint feature extraction and processing algorithm in practical applications.

Referring to fig. 7, since the number of original phonemes corresponding to the original text segment is large, in order to increase the translation speed without greatly reducing the translation sound effect, for each original text segment, step S762 may be executed first to select a plurality of marked voiceprint feature vectors from all the voiceprint feature vectors in the implementation process. Step S7622 may be performed next to calculate a vector difference between the acquired marked voiceprint feature vector and the non-marked voiceprint feature vector. Step S7623 may be performed to classify the non-labeled voiceprint feature vector and the labeled voiceprint feature vector with the smallest modular length of the vector difference into the same label group. Step S7624 may be performed to calculate a mean vector of all the marked voiceprint feature vectors and the unmarked voiceprint feature vectors within the mark group. Step S7625 may be performed next to calculate a marked voiceprint feature vector or a non-marked voiceprint feature vector having the smallest modulo length of the vector difference from the mean vector within the acquired marker group as the updated marked voiceprint feature vector. Step S7626 may be performed next to determine whether the marker voiceprint feature vector of the marker group has changed. If yes, step S7622 to step S7626 may be executed next, and the continuously updated tag group and the tag voiceprint feature vector are returned, if not, step S7627 may be executed next, and the original phoneme corresponding to the tag voiceprint feature vector is used as the feature original phoneme. Finally, step S7628 may be executed to use the scale factor between the accumulated durations of the original phonemes corresponding to the marked voiceprint feature vector or the unmarked voiceprint feature vector in the marked group as the duration scale factor of each corresponding feature original phoneme.

To supplement the above-described implementation procedures of step S7621 to step S7628, source codes of part of the functional modules are provided, and a comparison explanation is made in the annotation section.

/>

The code groups a group of phonemes through an algorithm, then selects the phoneme closest to the mean vector in each group as a characteristic original phoneme, and calculates the duration proportionality coefficient of each phoneme in the group where the phoneme is located. In this way, a set of representative phonemes and corresponding time length ratios may be obtained, facilitating subsequent sound analysis and synthesis operations.

The invention also discloses a storage medium, wherein at least one command, at least one section of program, code set or instruction set is stored in the storage medium, and the at least one command, the at least one section of program, the code set or the instruction set is loaded and executed by a processor to realize the pickup translation method based on the original voiceprint characteristics.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware, such as circuits or ASICs (application specific integrated circuits, application Specific Integrated Circuit), which perform the corresponding functions or acts, or combinations of hardware and software, such as firmware, etc.

Although the invention is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A pickup translation method based on original voiceprint features is characterized by comprising the following steps of,

acquiring primary audio;

translating the primitive species text into translated semantic text;

2. A method as defined in claim 1, wherein said step of modifying said translated phonemes based on said original phonemes and a corresponding order and corresponding ones of a plurality of categories of voiceprint features to obtain translated audio comprises,

carrying out semantic segmentation on the primitive text to obtain a plurality of original text segments and corresponding sequences;

carrying out semantic segmentation on the translation semantic text to obtain a plurality of translation text segments and corresponding sequences;

acquiring a plurality of original phonemes and corresponding sequences corresponding to each original text segment;

acquiring a plurality of translation phonemes corresponding to each translation text segment and a corresponding sequence;

performing semantic matching on the translation text segment according to the translation contrast relation between the primitive text and the translation semantic text to obtain a plurality of pairs of original text segments and translation text segments with the same semantics;

correcting a plurality of translation phonemes corresponding to the translation text segment according to a plurality of types of voiceprint features of the plurality of original phonemes corresponding to the original text segment to obtain a translation speech segment corresponding to the translation text segment;

and merging the translated segments according to the sequence of the translated text segments to obtain translated audio.

3. The method of claim 2, wherein said step of correcting said plurality of translated phones corresponding to said translated text segment based on said plurality of categories of voiceprint features of said plurality of original phones corresponding to said original text segment results in said translated text segment corresponding to a translated speech segment comprising,

for each pair of the original text segment and the translated text segment that are semantically matched,

vectorizing each kind of voiceprint feature of each original phoneme corresponding to the original text segment to obtain a voiceprint feature vector of each original phoneme corresponding to the original text segment,

selecting a plurality of characteristic original phonemes according to the voiceprint characteristic vector of each original phoneme corresponding to the original text segment, acquiring a duration proportionality coefficient of each characteristic original phoneme,

obtaining the sequence of the characteristic original phonemes corresponding to the original text segment according to the sequence of all the original phonemes corresponding to the original text segment and the sequence of the characteristic original phonemes corresponding to the original text segment,

acquiring the total duration of all the translated phonemes corresponding to the translated text segment,

the length scale factor of each characteristic original phoneme is arranged according to the sequence of the characteristic original phonemes, the total length of the translation phonemes corresponding to the translation text segment is divided, a plurality of translation phonemes corresponding to each characteristic original phoneme are obtained,

and endowing the voiceprint features of a plurality of types of feature original phonemes to a corresponding plurality of translation phonemes, and combining to obtain the translation text segment corresponding translation language segment.

4. The method of claim 3, wherein the step of selecting a plurality of feature original phonemes from the voiceprint feature vector of each of the original phonemes corresponding to the original text segment and obtaining a length scale factor of each of the feature original phonemes comprises,

for each of the sections of the original text,

selecting a plurality of marked voiceprint feature vectors from all voiceprint feature vectors;

calculating and obtaining a vector difference of the marked voiceprint feature vector and the unmarked voiceprint feature vector;

classifying the non-marked voiceprint feature vector and the marked voiceprint feature vector with the smallest modular length of the vector difference into the same marking group;

calculating to obtain average value vectors of all marked voiceprint feature vectors and unmarked voiceprint feature vectors in the marked group;

calculating a marked voiceprint feature vector or a non-marked voiceprint feature vector with the minimum modular length of the vector difference between the marked group and the mean vector as an updated marked voiceprint feature vector;

judging whether the mark voiceprint feature vector of the mark group changes or not;

if yes, returning to continuously update the mark group and the mark voiceprint feature vector;

if not, taking the original phonemes corresponding to the marked voiceprint feature vectors as feature original phonemes;

and taking the proportional coefficient between the accumulated time lengths of the original phonemes corresponding to the marked voiceprint feature vectors or the non-marked voiceprint feature vectors in the marked group as the time length proportional coefficient of each corresponding characteristic original phoneme.

5. The method of claim 3, wherein said step of dividing said primary audio into a plurality of said primary phonemes and correcting said translated phonemes in a corresponding order to obtain translated audio further comprises,

judging whether the translated text segment which does not correspond to the original text segment is subjected to semantic matching exists or not;

if not, not processing;

if so, for the translated text segment for which no corresponding original text segment is semantically matched, acquiring the corresponding original Wen Benduan of the translated text segment for which the corresponding original text segment is semantically matched as the semantically matched translated text segment and the corresponding original text segment.

6. The method of claim 2 wherein the step of merging the translated segments according to the order of the translated segments of text results in translated audio comprises,

acquiring tone characteristics of the native audio;

the translated audio is obtained by imparting tone characteristics of the native audio to a plurality of the translated segments combined in the order of the translated text segments.

7. A pickup translation method based on original voiceprint features is characterized by comprising the following steps of,

acquiring and storing an audio stream in real time;

acquiring a blank period in the voice stream;

intercepting the human voice stream between blank periods as native audio;

the native audio is modified according to a pickup translation method based on native voiceprint features of any one of claims 1 to 6 to obtain translated audio.

8. A pickup translation method based on original voiceprint features is characterized by comprising the following steps of,

receiving translated audio in a native voiceprint feature-based pick-up translation method of any one of claims 1 to 6;

playing the translated audio.

9. A pickup translation apparatus is characterized by comprising,

the microphone is used for recording and obtaining primary audio;

a translation unit for translating the native audio into translated audio according to a pickup translation method based on native voiceprint features of any one of claims 1 to 6;

and the loudspeaker is used for playing the translated audio.

10. A storage medium, comprising,

the storage medium has at least one command, at least one program, a code set, or an instruction set stored therein, and the at least one command, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement a pickup translation method based on native voiceprint features according to any one of claims 1 to 6.